Re: [tesseract-ocr] How to regenerate the training text

2017-06-15 Thread ShreeDevi Kumar
>Where are these scripts, or how can I otherwise generate training text
from dictionary/corpus data?

These are (most probably) internal scripts at Google which have not been
open sourced.

Please see
https://groups.google.com/forum/#!searchin/tesseract-ocr/training$20text%7Csort:date/tesseract-ocr/-B0mWBwki5w/zuR4R6AGAgAJ
which has Ray's comments regarding the generation of training text.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jun 15, 2017 at 7:05 PM, Dingyuan Wang 
wrote:

> Dear all,
>
> I'm trying to generate a training text (chi_sim) for training tesseract
> because I have a better dictionary and unigram/bigram data than the
> defaults. I've found the following comments in training/language-specific.
> sh
>
> (line 845)
> # Set language-specific values for several global variables, including
> #   ${TEXT_CORPUS}
> #  holds the text corpus file for the language, used in phase F
> #   ${FONTS[@]}
> #  holds a sequence of applicable fonts for the language, used in
> #  phase F & I. only set if not already set, i.e. from command line
> #   ${TRAINING_DATA_ARGUMENTS}
> #  non-default arguments to the training_data program used in phase T
> #   ${FILTER_ARGUMENTS} -
> #  character-code-specific filtering to distinguish between scripts
> #  (eg. CJK) used by filter_borbidden_characters in phase F
> #   ${WORDLIST2DAWG_ARGUMENTS}
> #  specify fixed length dawg generation for non-space-delimited lang
> # TODO(dsl): We can refactor these into functions that assign FONTS,
> # TEXT_CORPUS, etc. separately.
>
> So I suppose there are scripts called training_data (phrase T)
> and filter_borbidden_characters (sic, phrase F) to create the training
> text from some wordlists and unigram/bigram frequency data.
>
> Where are these scripts, or how can I otherwise generate training text
> from dictionary/corpus data?
>
> Thanks.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/9a5c68ce-43d5-449e-81c1-ff7237133053%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVn2655mukTEFmx0%3DVhfLMtdvVxY3Lx%2B%3DYW-o6HuqG_LQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] traineddata file size too small, error clue ?

2017-06-14 Thread ShreeDevi Kumar
Traineddata size will depend on many things, not just number of images.

If your unicharset and number of fonts hasn't changed, then the size maybe
similar.

Traineddata file also has the wordlists in it, so if you are using a
smaller wordlist compared to the one in original eng.traineddata, size
maybe smaller.

You can also try the latest version from
https://github.com/UB-Mannheim/tesseract/wiki

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Jun 14, 2017 at 11:39 PM, Andres  wrote:

> Dear all,
>
> I've been training tesseract with a multipage tiff file with 5 pages and
> approx 12000 boxes.
>
> Now I increased the samples in the tiff file, I have 12 pages and 29241
> boxes.
>
> My concern is that my previous traineddata file size is 321817 bytes and
> the new one is 318022 bytes. I don't know if it should be bigger, as I have
> no idea about the file format, but I downloaded one version
> of eng.traineddata from the tesseract repository and I see that its size is
> 21876572 bytes. Could it be that perhaps it is computing just the results
> of the first page ? I see in the log that at least, at the beginning of the
> process, it is processing all the pages.
>
> I am using Tesseract 3.02 on Windows.
>
> I will paste my log here, and below that, my batch file, the one that I
> use for training.
>
> Log:
>
> A:\training>tesseract.exe patentesar.normal.exp0.tif patentesar.normal.exp0 
> nobatch bo
> x.train.stderr
> Tesseract Open Source OCR Engine v3.02 with Leptonica
> Page 1 of 12
> row xheight=88.6667, but median xheight = 59.6
> row xheight=81.8333, but median xheight = 59.6
> row xheight=75, but median xheight = 59.6
> row xheight=71.1875, but median xheight = 59.6
> row xheight=71.1875, but median xheight = 59.6
> row xheight=71.1875, but median xheight = 59.6
> row xheight=68.5333, but median xheight = 59.6
> row xheight=67., but median xheight = 59.6
> APPLY_BOXES:
>Boxes read from boxfile:1671
>Found 1671 good blobs.
> TRAINING ... Font name = normal
> Generated training data for 52 words
> Page 2 of 12
> APPLY_BOXES:
>Boxes read from boxfile:2003
>Found 2003 good blobs.
> Generated training data for 58 words
> Page 3 of 12
> FAIL!
> APPLY_BOXES: boxfile line 358/0 ((383,4901),(428,4980)): FAILURE! Couldn't 
> find a matching blob
> FAIL!
> APPLY_BOXES: boxfile line 529/D ((146,4401),(187,4480)): FAILURE! Couldn't 
> find a matching blob
> APPLY_BOXES:
>Boxes read from boxfile:2128
>Boxes failed resegmentation:   2
>Found 2126 good blobs.
> Generated training data for 60 words
> Page 4 of 12
> APPLY_BOXES:
>Boxes read from boxfile:2257
>Found 2257 good blobs.
> Generated training data for 62 words
> Page 5 of 12
> APPLY_BOXES:
>Boxes read from boxfile:2381
>Found 2381 good blobs.
> Generated training data for 64 words
> Page 6 of 12
> FAIL!
> APPLY_BOXES: boxfile line 2070/D ((2141,967),(2182,1037)): FAILURE! Couldn't 
> find a matching blob
> APPLY_BOXES:
>Boxes read from boxfile:2460
>Boxes failed resegmentation:   1
>Found 2459 good blobs.
> Generated training data for 65 words
> Page 7 of 12
> FAIL!
> APPLY_BOXES: boxfile line 2082/B ((867,1084),(910,1151)): FAILURE! Couldn't 
> find a matching blob
> APPLY_BOXES:
>Boxes read from boxfile:2568
>Boxes failed resegmentation:   1
>Found 2567 good blobs.
> Generated training data for 67 words
> Page 8 of 12
> APPLY_BOXES:
>Boxes read from boxfile:2680
>Found 2680 good blobs.
> Generated training data for 68 words
> Page 9 of 12
> FAIL!
> APPLY_BOXES: boxfile line 2391/D ((1184,910),(1220,973)): FAILURE! Couldn't 
> find a matching blob
> APPLY_BOXES:
>Boxes read from boxfile:2818
>Boxes failed resegmentation:   1
>Found 2817 good blobs.
> Generated training data for 70 words
> Page 10 of 12
> FAIL!
> APPLY_BOXES: boxfile line 1248/0 ((1468,3440),(1502,3501)): FAILURE! Couldn't 
> find a matching blob
> FAIL!
> APPLY_BOXES: boxfile line 2211/0 ((342,1491),(382,1550)): FAILURE! Couldn't 
> find a matching blob
> APPLY_BOXES:
>Boxes read from boxfile:3000
>Boxes failed resegmentation:   2
>Found 2998 good blobs.
> Generated training data for 73 words
> Page 11 of 12
> FAIL!
> APPLY_BOXES: boxfile line 1280/6 ((2054,3645),(2087,3702)): FAILURE! Couldn't 
> find a matching blob
> FAIL!
> APPLY_BOXES: boxfile line 2750/0 ((496,1051),(528,1105)): FAILURE! Couldn't 
> find a matching blob
> FAIL!
> APPLY_BOXES: boxfile line 3098/D ((2229,530),(2254,583)): FAILURE! Couldn't 
> find a matching blob
> FAIL!
> APPLY_BOXES: boxfile line 3347/Q ((1167,90),(1197,142)): FAILURE! Couldn't 
> find a matching blob
> APPLY_BOXES:
>Boxes read from boxfile:3370
>Boxes failed resegmentation:   4
>Found 3366 good blobs.
> Generated training data for 77 words
> Page 12 of 12
> row 

Re: [tesseract-ocr] oem Detection

2017-06-14 Thread ShreeDevi Kumar
check that the file is there

ls -l  */home/ibr/tesstutorial/impact_from_full/jpn.lstm*

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Jun 14, 2017 at 7:28 PM, Ibr  wrote:

> yes I already extracted the lstm file and specified that at the argument
> continue:  *--continue_from ~/tesstutorial/impact_from_full/jpn.lstm*
> isn't this step should do it?
> yet the error keep coming:
>
>
> *Loaded file /home/ibr/tesstutorial/impact_from_full/jpn.lstm,
> unpacking...Failed to continue from:
> /home/ibr/tesstutorial/impact_from_full/jpn.lstm*
>
> Thanks for the response
>
> On Wednesday, June 14, 2017 at 4:49:51 PM UTC+3, shree wrote:
>
>> You need to extract .lstm from traineddata
>>
>> eg. (change foldernames to match ur setup)
>>
>> combine_tessdata -e  ../tessdata/jpn.traineddata jpn.lstm
>> Extracting tessdata components from ../tessdata/jpn.traineddata
>> Wrote jpn.lstm
>> 0:config:size=2573, offset=168
>> 1:unicharset:size=280627, offset=2741
>> 2:unicharambigs:size=4676, offset=283368
>> 3:inttemp:size=30618346, offset=288044
>> 4:pffmtable:size=36561, offset=30906390
>> 5:normproto:size=452735, offset=30942951
>> 6:punc-dawg:size=2602, offset=31395686
>> 7:word-dawg:size=1007922, offset=31398288
>> 8:number-dawg:size=42, offset=32406210
>> 9:freq-dawg:size=1146, offset=32406252
>> 13:shapetable:size=664546, offset=32407398
>> 16:params-model:size=699, offset=33071944
>> 17:lstm:size=10299009, offset=33072643
>> 18:lstm-punc-dawg:size=2602, offset=43371652
>> 19:lstm-word-dawg:size=1005930, offset=43374254
>> 20:lstm-number-dawg:size=50, offset=44380184
>>
>>
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/8bfe51b8-68fe-4128-9d46-c8000238f354%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXY_OvjXqRfv4B8DVNEgshVrB2%3Dtxo7OXdx0V8fMTA_uQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Font List

2017-06-14 Thread ShreeDevi Kumar
> what is the difference between the engtrain and engeval?

It will depend on what fonts and training text you use for each.

one is used for training, the other is for evaluation of the training.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Jun 14, 2017 at 5:58 PM, Ibr  wrote:

> UPDATE
>>
>
> I figured out how to use the list, and seems two commands are the same, so
> still the question, what is the difference between the engtrain and
> engeval?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/9bf06943-bbb9-4b8e-86f3-ab48a54225df%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVuDUAAS0zEXGNLTVJWvLnoZpnVQKCjL%3D9BgXMutw7rFw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] oem Detection

2017-06-14 Thread ShreeDevi Kumar
You need to extract .lstm from traineddata

eg. (change foldernames to match ur setup)

combine_tessdata -e  ../tessdata/jpn.traineddata jpn.lstm
Extracting tessdata components from ../tessdata/jpn.traineddata
Wrote jpn.lstm
0:config:size=2573, offset=168
1:unicharset:size=280627, offset=2741
2:unicharambigs:size=4676, offset=283368
3:inttemp:size=30618346, offset=288044
4:pffmtable:size=36561, offset=30906390
5:normproto:size=452735, offset=30942951
6:punc-dawg:size=2602, offset=31395686
7:word-dawg:size=1007922, offset=31398288
8:number-dawg:size=42, offset=32406210
9:freq-dawg:size=1146, offset=32406252
13:shapetable:size=664546, offset=32407398
16:params-model:size=699, offset=33071944
17:lstm:size=10299009, offset=33072643
18:lstm-punc-dawg:size=2602, offset=43371652
19:lstm-word-dawg:size=1005930, offset=43374254
20:lstm-number-dawg:size=50, offset=44380184


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Jun 14, 2017 at 6:45 PM, Ibr  wrote:

> is this command correct too create the intermediate .lstm and _checlpoint?
>
> training/lstmtraining --model_output ~/tesstutorial/impact_from_small/impact
> \
>--train_listfile ~/tesstutorial/jpntrain/jpn.training_files.txt  \
>   --continue_from ~/tesstutorial/impact_from_full/jpn.lstm
>
> as for --continue_from, its mentioned in here
> 
> its can be for recognition model which is be .lstm, if not what is the
> existing model? because when I run the command above it says:-
> Loaded file /home/ibr/tesstutorial/impact_from_full/jpn.traineddata,
> unpacking...
> Failed to continue from: /home/ibr/tesstutorial/impact_
> from_full/jpn.traineddata
>
>
> On Tuesday, June 13, 2017 at 4:28:21 PM UTC+3, shree wrote:
>
>> combine_tessdata -e
>>
>> extracts the lstm file from the traineddata provided from original
>> training by google.
>>
>> -
>>  tesstrain.sh it will create .lstmf files
>>
>> yes. these are created from the box-tiff pairs created from the training
>> text and fonts
>>
>> ---
>>
>> lstmtraining program takes all of these .lstmf files (via the file which
>> has all the .lstmf filenames)
>> and
>> creates intermediate .lstm files and _checkpoint files
>>
>> ---
>> these can be converted to the final .lstm file for use in traineddata
>> --
>> the final .lstm file has to be combined using combine_tessdata to create
>> new traineddata.
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, Jun 13, 2017 at 6:09 PM, Ibr  wrote:
>>
>>> thanks for the response, well actually I wrote the command wrong, I
>>> wanted to combine, also I didn't extract the lstm file before I do the
>>> combination, which brings another question.
>>>
>>> if I use the tesstrain.sh it will create .lstmf files, correct? but if I
>>> used combine_tessdata -e that will create lstm file, so what is the
>>> difference between both of them?
>>> I know that lstmf files are substitute for the .tr files, if you gave me
>>> little explanation about both I would be grateful, since there were not
>>> much of explanation on the web about them
>>>
>>> Thanks in advance
>>>
>>>
>>> On Tuesday, June 13, 2017 at 3:03:40 PM UTC+3, shree wrote:
>>>
 you have to be clear on what files you are combining.

 the command you have given is overwriting japanese traineddata - is
 that what you want to do?

 > *training/combine_tessdata -o tessdata/jpn.traineddata*

 *Look at help for all options of combine_tessdata*

 *Figure out which files (lstm, dawg etc) you want to combine*

 *Give appropriate command options and files to create new traineddata*

 ShreeDevi
 
 भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

 On Tue, Jun 13, 2017 at 5:25 PM, Ibr  wrote:

> seems so, to add or merge the new LSTM files in the traineddata this
> command to user correct: *training/combine_tessdata -o
> tessdata/jpn.traineddata ~/tesstutorial/eng_from_chi/.lstm*
> but that gave me the following:
> TessdataManager can't determine which tessdata component is
> represented by lstmf
> TessdataManager combined tesseract data files.
> Offset for type  0 (.traineddataconfig) is 172
> Offset for type  1 (.traineddataunicharset) is 2745
> Offset for type  2 (.traineddataunicharambigs ) is 283372
> Offset for type  3 (.traineddatainttemp   ) is 288048
> Offset for type  4 (.traineddatapffmtable ) is 30906394
> Offset for type  5 (.traineddatanormproto 

Re: [tesseract-ocr] oem Detection

2017-06-13 Thread ShreeDevi Kumar
combine_tessdata -e

extracts the lstm file from the traineddata provided from original training
by google.

-
 tesstrain.sh it will create .lstmf files

yes. these are created from the box-tiff pairs created from the training
text and fonts

---

lstmtraining program takes all of these .lstmf files (via the file which
has all the .lstmf filenames)
and
creates intermediate .lstm files and _checkpoint files

---
these can be converted to the final .lstm file for use in traineddata
--
the final .lstm file has to be combined using combine_tessdata to create
new traineddata.


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jun 13, 2017 at 6:09 PM, Ibr  wrote:

> thanks for the response, well actually I wrote the command wrong, I wanted
> to combine, also I didn't extract the lstm file before I do the
> combination, which brings another question.
>
> if I use the tesstrain.sh it will create .lstmf files, correct? but if I
> used combine_tessdata -e that will create lstm file, so what is the
> difference between both of them?
> I know that lstmf files are substitute for the .tr files, if you gave me
> little explanation about both I would be grateful, since there were not
> much of explanation on the web about them
>
> Thanks in advance
>
>
> On Tuesday, June 13, 2017 at 3:03:40 PM UTC+3, shree wrote:
>
>> you have to be clear on what files you are combining.
>>
>> the command you have given is overwriting japanese traineddata - is that
>> what you want to do?
>>
>> > *training/combine_tessdata -o tessdata/jpn.traineddata*
>>
>> *Look at help for all options of combine_tessdata*
>>
>> *Figure out which files (lstm, dawg etc) you want to combine*
>>
>> *Give appropriate command options and files to create new traineddata*
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, Jun 13, 2017 at 5:25 PM, Ibr  wrote:
>>
>>> seems so, to add or merge the new LSTM files in the traineddata this
>>> command to user correct: *training/combine_tessdata -o
>>> tessdata/jpn.traineddata ~/tesstutorial/eng_from_chi/.lstm*
>>> but that gave me the following:
>>> TessdataManager can't determine which tessdata component is represented
>>> by lstmf
>>> TessdataManager combined tesseract data files.
>>> Offset for type  0 (.traineddataconfig) is 172
>>> Offset for type  1 (.traineddataunicharset) is 2745
>>> Offset for type  2 (.traineddataunicharambigs ) is 283372
>>> Offset for type  3 (.traineddatainttemp   ) is 288048
>>> Offset for type  4 (.traineddatapffmtable ) is 30906394
>>> Offset for type  5 (.traineddatanormproto ) is 30942955
>>> Offset for type  6 (.traineddatapunc-dawg ) is 31395690
>>> Offset for type  7 (.traineddataword-dawg ) is 31398292
>>> Offset for type  8 (.traineddatanumber-dawg   ) is 32406214
>>> Offset for type  9 (.traineddatafreq-dawg ) is 32406256
>>> Offset for type 10 (.traineddatafixed-length-dawgs) is -1
>>> Offset for type 11 (.traineddatacube-unicharset   ) is -1
>>> Offset for type 12 (.traineddatacube-word-dawg) is -1
>>> Offset for type 13 (.traineddatashapetable) is 32407402
>>> Offset for type 14 (.traineddatabigram-dawg   ) is -1
>>> Offset for type 15 (.traineddataunambig-dawg  ) is -1
>>> Offset for type 16 (.traineddataparams-model  ) is 33071948
>>> Offset for type 17 (.traineddatalstm  ) is 33072647
>>> Offset for type 18 (.traineddatalstm-punc-dawg) is 43371656
>>> Offset for type 19 (.traineddatalstm-word-dawg) is 43374258
>>> Offset for type 20 (.traineddatalstm-number-dawg  ) is 44380188
>>>
>>> any idea?
>>> thanks
>>>
>>>
>>> On Tuesday, June 13, 2017 at 2:36:54 PM UTC+3, shree wrote:
>>>
 *tesseract image results -l ara --tessdata-dir ./tessdata --oem 1*

 *uses the LSTM files that are there in ara.traineddata in your tessdata
 directory.*

 *Just placing lstm files in tesseract folder is not going to change
 anything.*

 *You need to create a new traineddata with the new lstm files and then
 test with it.*

 ShreeDevi
 
 भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

 On Tue, Jun 13, 2017 at 3:17 PM, Ibr  wrote:

> Hi,
>
> when make detection using the tesseract 4.00.00alpha and use the
> command: *tesseract image results -l ara --tessdata-dir ./tessdata
> --oem 1 *the oem here means "Neural nets LSTM only", so there is no
> argument in tesseract to specify where to find the LSTM files, how the

Re: [tesseract-ocr] oem Detection

2017-06-13 Thread ShreeDevi Kumar
you have to be clear on what files you are combining.

the command you have given is overwriting japanese traineddata - is that
what you want to do?

> *training/combine_tessdata -o tessdata/jpn.traineddata*

*Look at help for all options of combine_tessdata*

*Figure out which files (lstm, dawg etc) you want to combine*

*Give appropriate command options and files to create new traineddata*

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jun 13, 2017 at 5:25 PM, Ibr  wrote:

> seems so, to add or merge the new LSTM files in the traineddata this
> command to user correct: *training/combine_tessdata -o
> tessdata/jpn.traineddata ~/tesstutorial/eng_from_chi/.lstm*
> but that gave me the following:
> TessdataManager can't determine which tessdata component is represented by
> lstmf
> TessdataManager combined tesseract data files.
> Offset for type  0 (.traineddataconfig) is 172
> Offset for type  1 (.traineddataunicharset) is 2745
> Offset for type  2 (.traineddataunicharambigs ) is 283372
> Offset for type  3 (.traineddatainttemp   ) is 288048
> Offset for type  4 (.traineddatapffmtable ) is 30906394
> Offset for type  5 (.traineddatanormproto ) is 30942955
> Offset for type  6 (.traineddatapunc-dawg ) is 31395690
> Offset for type  7 (.traineddataword-dawg ) is 31398292
> Offset for type  8 (.traineddatanumber-dawg   ) is 32406214
> Offset for type  9 (.traineddatafreq-dawg ) is 32406256
> Offset for type 10 (.traineddatafixed-length-dawgs) is -1
> Offset for type 11 (.traineddatacube-unicharset   ) is -1
> Offset for type 12 (.traineddatacube-word-dawg) is -1
> Offset for type 13 (.traineddatashapetable) is 32407402
> Offset for type 14 (.traineddatabigram-dawg   ) is -1
> Offset for type 15 (.traineddataunambig-dawg  ) is -1
> Offset for type 16 (.traineddataparams-model  ) is 33071948
> Offset for type 17 (.traineddatalstm  ) is 33072647
> Offset for type 18 (.traineddatalstm-punc-dawg) is 43371656
> Offset for type 19 (.traineddatalstm-word-dawg) is 43374258
> Offset for type 20 (.traineddatalstm-number-dawg  ) is 44380188
>
> any idea?
> thanks
>
>
> On Tuesday, June 13, 2017 at 2:36:54 PM UTC+3, shree wrote:
>
>> *tesseract image results -l ara --tessdata-dir ./tessdata --oem 1*
>>
>> *uses the LSTM files that are there in ara.traineddata in your tessdata
>> directory.*
>>
>> *Just placing lstm files in tesseract folder is not going to change
>> anything.*
>>
>> *You need to create a new traineddata with the new lstm files and then
>> test with it.*
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, Jun 13, 2017 at 3:17 PM, Ibr  wrote:
>>
>>> Hi,
>>>
>>> when make detection using the tesseract 4.00.00alpha and use the
>>> command: *tesseract image results -l ara --tessdata-dir ./tessdata
>>> --oem 1 *the oem here means "Neural nets LSTM only", so there is no
>>> argument in tesseract to specify where to find the LSTM files, how the
>>> tesseract find them? I used to place the LSTM files inside the tesseract
>>> folder, but I tried to detect after I deleted the LSTM files, with the
>>> argument --oem 1 which meanst LSTM only yet the detection happened, so does
>>> the tesseract search in other folders for LSTM files? as I had LSTM files
>>> in different folders
>>>
>>> Thanks.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/eefc8290-c407-4075-b845-4b226094e752%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/16ce1839-6af2-4c5a-850a-62843b185b4b%
> 40googlegroups.com
> 

Re: [tesseract-ocr] oem Detection

2017-06-13 Thread ShreeDevi Kumar
*tesseract image results -l ara --tessdata-dir ./tessdata --oem 1*

*uses the LSTM files that are there in ara.traineddata in your tessdata
directory.*

*Just placing lstm files in tesseract folder is not going to change
anything.*

*You need to create a new traineddata with the new lstm files and then test
with it.*

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jun 13, 2017 at 3:17 PM, Ibr  wrote:

> Hi,
>
> when make detection using the tesseract 4.00.00alpha and use the command: 
> *tesseract
> image results -l ara --tessdata-dir ./tessdata --oem 1 *the oem here
> means "Neural nets LSTM only", so there is no argument in tesseract to
> specify where to find the LSTM files, how the tesseract find them? I used
> to place the LSTM files inside the tesseract folder, but I tried to detect
> after I deleted the LSTM files, with the argument --oem 1 which meanst LSTM
> only yet the detection happened, so does the tesseract search in other
> folders for LSTM files? as I had LSTM files in different folders
>
> Thanks.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/eefc8290-c407-4075-b845-4b226094e752%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUszBOLY-RcaNLWX7txoVmNuQ_xQXOOawR09%3DwzqgwTtQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Tesseract on Bitmap images giving error - Error: "Failed to create pix, this normally occurs because...

2017-06-12 Thread ShreeDevi Kumar
Hari,

Please also look in the leptonica program directory
for
pdf2tiff
pdf2mtiff
etc

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVdmwSqAUg1By08wUkx6LTNeAkLRjahbiYcZdMbq8RDbQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Tesseract on Bitmap images giving error - Error: "Failed to create pix, this normally occurs because...

2017-06-12 Thread ShreeDevi Kumar
Thanks, Dan.

Forwarding your message to the group and original poster - who was getting
errors with large bitmaps

>>when a bitmap image is created newly, and if the image dimensions are
exceeding *1900 x 2475*, and in the next line when the same bitmap is being
tried to convert to *Pix *then at that point of time, I am getting the
error which I was talking about in the post.


On Mon, Jun 12, 2017 at 7:52 PM, Dan Bloomberg <dan.bloomb...@gmail.com>
wrote:

> ​
> ​
>   >> BitmapToPixConverter b = new BitmapToPixConverter();
>
> ​>>​
>Pix pix = b.Convert(bitmap);
>
> This is not leptonica code.​  It shouldn't compile, with b being a ptr
> that is dereferenced with a ".".  This is then set equal to a pix which is
> (as written) not a ptr either, causing a copy if it were correct.
>
>
> On Mon, Jun 12, 2017 at 12:16 AM, ShreeDevi Kumar <shreesh...@gmail.com>
> wrote:
>
>> image processing within tesseract is done by leptonica.
>>
>> https://github.com/DanBloomberg/leptonica
>>
>> + dan bloomberg
>>
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Mon, Jun 12, 2017 at 11:25 AM, Hari.K <harik...@gmail.com> wrote:
>>
>>> Thanks Shree.
>>>
>>> Hello Quan,
>>>
>>> Here are my further updates / observations on the post :
>>>
>>> - The error which I had mentioned in this post is actually occurring in
>>> the below yellow highlighted line.
>>> - As per my analysis, when a bitmap image is created newly, and if the
>>> image dimensions are exceeding *1900 x 2475*, and in the next line when
>>> the same bitmap is being tried to convert to *Pix *then at that point
>>> of time, I am getting the error which I was talking about in the post.
>>>
>>>
>>> for (int i = 0; i <= document.Pages.Count; i++)
>>> {
>>> bitmap = (Bitmap)document.SaveAsImage(i,
>>> PdfImageType.Bitmap, 200, 200);
>>>
>>>
>>> ​​
>>>   BitmapToPixConverter b = new BitmapToPixConverter();
>>> Pix pix = b.Convert(bitmap);
>>>   .
>>>  }
>>> So as per what I understand the Tesseract is not able to convert since
>>> the generated bitmap is of higher dimensions and it is throwing that error
>>> what we are talking about in the post.
>>>
>>> Is anyone sure that Tesseract has these kind of limitations while
>>> converting a bitmap of higher dimensions ??
>>>
>>> Now, the only way to get rid of this issue is to resize the bitmap image
>>> before I try to convert it to Pix ? Am I in the right direction, any other
>>> ideas please ?
>>>
>>> Thanks in Advance,
>>> Hari
>>>
>>> On Friday, 9 June 2017 11:59:08 UTC+5:30, shree wrote:
>>>>
>>>> + quan
>>>>
>>>> Quan will be better able to advice regarding .net
>>>>
>>>> also see https://sourceforge.net/projects/vietocr/files/vietocr.n
>>>> et/5.0alpha/
>>>>
>>>> ShreeDevi
>>>> 
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>> On Fri, Jun 9, 2017 at 10:44 AM, Hari.K <hari...@gmail.com> wrote:
>>>>
>>>>> Thank you Shree for replying back on the issue. Yes I know about
>>>>> ghostscript and its commands, but with the present architecture of project
>>>>> we are restricted to acomodate the ghostscript commands. Besides, I am 
>>>>> also
>>>>> aware of "gsdll32.dll", but as it is not a .Net managed library, and we
>>>>> can't reference it directly in a project and moreover we will have to go 
>>>>> by
>>>>> the PInvoke procedure, hence for all those above reasons and limitations 
>>>>> we
>>>>> are supposed to stay away from ghostscript.
>>>>>
>>>>> Do you think we have any better alternative libraries which I can make
>>>>> use of so that I would not be getting that error which I mentioned in this
>>>>> post ?
>>>>>
>>>>> Thanks in Advance,
>>>>> Hari
>>>>>
>>>>> On Thursday, 8 June 2017 21:16:15 UTC+5:30, shree wrote:
>>>>>>
>>>>>&

Re: [tesseract-ocr] Detect Multiple Images by Command Line

2017-06-12 Thread ShreeDevi Kumar
see  https://github.com/tesseract-ocr/tesseract/issues/928

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jun 12, 2017 at 3:58 PM, Ibr  wrote:

> Hi,
>
> When I want to detect an image on the tesseract 4.00alpha I run the
> command *tesseract image results -l lang --tessdata-dir ./tessdata --oem
> 1* .
>
> my question is, when I need to detect say 10 image, for example image1,
> image2 image3 etc. but I want to do that all in one command, and
> include all the results in the same result file which is "result" how the
> command should look like in this case?
>
> Thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/da7c6918-3449-4bbe-b2e6-7831375e57d6%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXtJzMGsoBe7vowDyoFD_hR51XM6H_CS6zKL%3DbBnB_Vig%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Tesseract on Bitmap images giving error - Error: "Failed to create pix, this normally occurs because...

2017-06-12 Thread ShreeDevi Kumar
image processing within tesseract is done by leptonica.

https://github.com/DanBloomberg/leptonica

+ dan bloomberg



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jun 12, 2017 at 11:25 AM, Hari.K  wrote:

> Thanks Shree.
>
> Hello Quan,
>
> Here are my further updates / observations on the post :
>
> - The error which I had mentioned in this post is actually occurring in
> the below yellow highlighted line.
> - As per my analysis, when a bitmap image is created newly, and if the
> image dimensions are exceeding *1900 x 2475*, and in the next line when
> the same bitmap is being tried to convert to *Pix *then at that point of
> time, I am getting the error which I was talking about in the post.
>
>
> for (int i = 0; i <= document.Pages.Count; i++)
> {
> bitmap = (Bitmap)document.SaveAsImage(i,
> PdfImageType.Bitmap, 200, 200);
>
> BitmapToPixConverter b = new BitmapToPixConverter();
> Pix pix = b.Convert(bitmap);
>   .
>  }
> So as per what I understand the Tesseract is not able to convert since the
> generated bitmap is of higher dimensions and it is throwing that error what
> we are talking about in the post.
>
> Is anyone sure that Tesseract has these kind of limitations while
> converting a bitmap of higher dimensions ??
>
> Now, the only way to get rid of this issue is to resize the bitmap image
> before I try to convert it to Pix ? Am I in the right direction, any other
> ideas please ?
>
> Thanks in Advance,
> Hari
>
> On Friday, 9 June 2017 11:59:08 UTC+5:30, shree wrote:
>>
>> + quan
>>
>> Quan will be better able to advice regarding .net
>>
>> also see https://sourceforge.net/projects/vietocr/files/vietocr.
>> net/5.0alpha/
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Fri, Jun 9, 2017 at 10:44 AM, Hari.K  wrote:
>>
>>> Thank you Shree for replying back on the issue. Yes I know about
>>> ghostscript and its commands, but with the present architecture of project
>>> we are restricted to acomodate the ghostscript commands. Besides, I am also
>>> aware of "gsdll32.dll", but as it is not a .Net managed library, and we
>>> can't reference it directly in a project and moreover we will have to go by
>>> the PInvoke procedure, hence for all those above reasons and limitations we
>>> are supposed to stay away from ghostscript.
>>>
>>> Do you think we have any better alternative libraries which I can make
>>> use of so that I would not be getting that error which I mentioned in this
>>> post ?
>>>
>>> Thanks in Advance,
>>> Hari
>>>
>>> On Thursday, 8 June 2017 21:16:15 UTC+5:30, shree wrote:

 Have you tried using ghostscript to convert pdf to tif files instead?
 Example commands

 gs   -r600x600 -sDEVICE=tiffg4   -dFirstPage=106  -dLastPage=109-o
 ./tulasi/tulasikrishna%00d.tif  "TulasiPuja.pdf"

 for one tif per page

 gs   -r600x600 -sDEVICE=tiffg4   -dFirstPage=126  -dLastPage=131-o
 ./tulasi/tulasIviShNupUjA.tif  "TulasiPuja.pdf"

 for multipage tif

 you can reduce resolution to -r300x300

 ShreeDevi
 
 भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

 On Thu, Jun 8, 2017 at 7:25 PM, Hari.K  wrote:

> Hi There,
>
> I sometimes receive an error - "Failed to create pix, this
> normally occurs because the requested image size is too large, please 
> check
> Standard Error Output" when doing OCR on a bitmap image.
>
>
> Below highlighted line is where it's breaking for me -
>
>  Bitmap bitmap;
> Spire.Pdf.PdfDocument document = new Spire.Pdf.PdfDocument(pdfPath);
>
>
> for (int i = 0; i <= document.Pages.Count; i++)
> {
> bitmap = (Bitmap)document.SaveAsImage(i,
> PdfImageType.Bitmap, 200, 200); // where 200 is the DPI which I am
> setting for a bitmap image
> ...
> .
>
> }
>
> More details on what I am trying to do here:
> 1) Uploaded a PDF document which is of hardly 600KB
> 2) Iterate through each PDF page and convert it into a BitMap image
> 3) Then input this BitMap image to Tesseract for performing OCR
>
> Please note, I don't get this error often. Any ideas on why this error
> as I do not receive this every time ?
>
> Looking forward for some inputs on this..
>
> Thanks in Advance,
> Hari
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe 

Re: [tesseract-ocr] Re: What is the "Confidence"value returned by Tesseract and how it is calculated?

2017-06-09 Thread ShreeDevi Kumar
Technical documentation links

https://github.com/tesseract-ocr/tesseract/wiki/Technical-Documentation

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWoiRNP4M1ktDTfGpdYDgO2AvzmM01KY32zpwh6n-ko%2Bg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Tesseract on Bitmap images giving error - Error: "Failed to create pix, this normally occurs because...

2017-06-09 Thread ShreeDevi Kumar
+ quan

Quan will be better able to advice regarding .net

also see https://sourceforge.net/projects/vietocr/files/
vietocr.net/5.0alpha/

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Jun 9, 2017 at 10:44 AM, Hari.K  wrote:

> Thank you Shree for replying back on the issue. Yes I know about
> ghostscript and its commands, but with the present architecture of project
> we are restricted to acomodate the ghostscript commands. Besides, I am also
> aware of "gsdll32.dll", but as it is not a .Net managed library, and we
> can't reference it directly in a project and moreover we will have to go by
> the PInvoke procedure, hence for all those above reasons and limitations we
> are supposed to stay away from ghostscript.
>
> Do you think we have any better alternative libraries which I can make use
> of so that I would not be getting that error which I mentioned in this post
> ?
>
> Thanks in Advance,
> Hari
>
> On Thursday, 8 June 2017 21:16:15 UTC+5:30, shree wrote:
>>
>> Have you tried using ghostscript to convert pdf to tif files instead?
>> Example commands
>>
>> gs   -r600x600 -sDEVICE=tiffg4   -dFirstPage=106  -dLastPage=109-o
>> ./tulasi/tulasikrishna%00d.tif  "TulasiPuja.pdf"
>>
>> for one tif per page
>>
>> gs   -r600x600 -sDEVICE=tiffg4   -dFirstPage=126  -dLastPage=131-o
>> ./tulasi/tulasIviShNupUjA.tif  "TulasiPuja.pdf"
>>
>> for multipage tif
>>
>> you can reduce resolution to -r300x300
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Thu, Jun 8, 2017 at 7:25 PM, Hari.K  wrote:
>>
>>> Hi There,
>>>
>>> I sometimes receive an error - "Failed to create pix, this normally
>>> occurs because the requested image size is too large, please check Standard
>>> Error Output" when doing OCR on a bitmap image.
>>>
>>>
>>> Below highlighted line is where it's breaking for me -
>>>
>>>  Bitmap bitmap;
>>> Spire.Pdf.PdfDocument document = new Spire.Pdf.PdfDocument(pdfPath);
>>>
>>>
>>> for (int i = 0; i <= document.Pages.Count; i++)
>>> {
>>> bitmap = (Bitmap)document.SaveAsImage(i,
>>> PdfImageType.Bitmap, 200, 200); // where 200 is the DPI which I am
>>> setting for a bitmap image
>>> ...
>>> .
>>>
>>> }
>>>
>>> More details on what I am trying to do here:
>>> 1) Uploaded a PDF document which is of hardly 600KB
>>> 2) Iterate through each PDF page and convert it into a BitMap image
>>> 3) Then input this BitMap image to Tesseract for performing OCR
>>>
>>> Please note, I don't get this error often. Any ideas on why this error
>>> as I do not receive this every time ?
>>>
>>> Looking forward for some inputs on this..
>>>
>>> Thanks in Advance,
>>> Hari
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/dcfe7918-707b-4b56-9720-b3e39ae1a658%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/ms
> gid/tesseract-ocr/79f1f939-9fd3-4869-8dbd-15945a91591a%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUO%3DAqwWv1zM8RvwLgYZZfmMEHBpe%3DKMv_DZ0yZU8KcYQ%40mail.gmail.com.
For more options, visit 

Re: [tesseract-ocr] Tesseract on Bitmap images giving error - Error: "Failed to create pix, this normally occurs because...

2017-06-08 Thread ShreeDevi Kumar
Have you tried using ghostscript to convert pdf to tif files instead?
Example commands

gs   -r600x600 -sDEVICE=tiffg4   -dFirstPage=106  -dLastPage=109-o
./tulasi/tulasikrishna%00d.tif  "TulasiPuja.pdf"

for one tif per page

gs   -r600x600 -sDEVICE=tiffg4   -dFirstPage=126  -dLastPage=131-o
./tulasi/tulasIviShNupUjA.tif  "TulasiPuja.pdf"

for multipage tif

you can reduce resolution to -r300x300

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jun 8, 2017 at 7:25 PM, Hari.K  wrote:

> Hi There,
>
> I sometimes receive an error - "Failed to create pix, this normally
> occurs because the requested image size is too large, please check Standard
> Error Output" when doing OCR on a bitmap image.
>
>
> Below highlighted line is where it's breaking for me -
>
>  Bitmap bitmap;
> Spire.Pdf.PdfDocument document = new Spire.Pdf.PdfDocument(pdfPath);
>
>
> for (int i = 0; i <= document.Pages.Count; i++)
> {
> bitmap = (Bitmap)document.SaveAsImage(i,
> PdfImageType.Bitmap, 200, 200); // where 200 is the DPI which I am
> setting for a bitmap image
> ...
> .
>
> }
>
> More details on what I am trying to do here:
> 1) Uploaded a PDF document which is of hardly 600KB
> 2) Iterate through each PDF page and convert it into a BitMap image
> 3) Then input this BitMap image to Tesseract for performing OCR
>
> Please note, I don't get this error often. Any ideas on why this error as
> I do not receive this every time ?
>
> Looking forward for some inputs on this..
>
> Thanks in Advance,
> Hari
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/dcfe7918-707b-4b56-9720-b3e39ae1a658%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXaN-9w4LG_0SFrEGy7GnxQeJiDbn5E2J-Po6yBwRfdFA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: How can I convert font data from ver 3.02 to 3.05

2017-06-06 Thread ShreeDevi Kumar
As far as I know, the traineddata files for 3.04 (also usable for 3.05) are
github versions of the files posted on code.google.com for 3.02. So, I
would think 3.02 traineddata files will work with 3.05 but newer files will
not work with 3.02.

Best is to give it a try and report your results.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Jun 7, 2017 at 10:16 AM, RND Android 
wrote:

> Sorry I meant 3.02 to 3.05.
>
> Addition: Is there anyway that I can use 3.02 font data for tesseract 3.05?
>
> On Wednesday, June 7, 2017 at 10:58:03 AM UTC+7, RND Android wrote:
>>
>> Hi, I have some trained data file for several fonts which successfully
>> used for tesseract ver 5.02, now my company upgrade the tesseract ver to
>> 5.05, so how can I convert those trained data fonts from ver 5.02 to be
>> used on ver 5.05? Please help
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/585f29e3-a9ea-4e01-af89-51dd6bec9395%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXVH0Vn5MRTjkBQYhe5-a2DCwB%3DDd4HtY2bXB-c_6-DvA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Does any parameter to control ocr region?

2017-06-06 Thread ShreeDevi Kumar
try latest code from
http://www.emgu.com/wiki/index.php/Version_History#Emgu.CV-3.2.0

I converted the bmp to png and tried with command line tesseract 4 and get
correct result.

$ tesseract I.png stdout --oem 1 --psm 6
D


$ tesseract I.png stdout --oem 0 --psm 6
D

original .bmp also works.

$ tesseract I.bmp stdout --oem 0 --psm 6
D

Warning. Invalid resolution 0 dpi. Using 70 instead.


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jun 6, 2017 at 7:04 PM, Duck  wrote:

> I need some help.
>
> The following pic is my problem, it was always recognized as "I".
>
> I trace for a while, find out that OCR engine segement again, it takes out
> the mid area of the "D".
>
> but I tried a lot of parameter, can't disable the segement process.
>
> Does anyone have any idea?  Or the only way is adjusting image?
>
> And if in any condition that each result in same tesseract class instance
> in different times would not the same?
>
> the same pic in my program, I click "start" twice and get the differnet
> result.
>
> but if I new tesseract class every time, it would not happen...
>
> please!
>
> (the enginemode is tesseractonly, pagesegmode is sinlechar, I use
> Tesseract 3.04 in EmguCV)
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/87915489-7649-4da2-9408-1a23b5543194%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVTyB36%3DGsjen7PRJ_hWbhcKHhSyC10Nne%3Dfgb5fNaa4Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Italian - Missing special-words

2017-06-05 Thread ShreeDevi Kumar
Yes, it should be there in tessdata like eng.user-words

Please open an issue withdetails and link to this thread also, so that it
can be added.

Thanks!

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jun 5, 2017 at 8:02 PM, John Muccigrosso  wrote:

>
>
> On Monday, June 5, 2017 at 10:07:59 AM UTC-4, shree wrote:
>>
>> File is there in langdata
>>
>> https://github.com/tesseract-ocr/langdata/blob/master/ita/it
>> a.special-words
>>
>> and is referred to in the language config file
>>
>> https://github.com/tesseract-ocr/langdata/blob/master/ita/ita.config
>>
>
> Thanks.
>
> I'm doing this by installing tesseract via homebrew, then keeping a local
> copy of tessdata via github. tessdata doesn't have the special-words file
> (which in this case is only two lines anyway). Perhaps it should?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/f45a38e5-4f69-430d-a74f-3bdcd3ca14b5%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWXf-Gne9%2Bxk5vsck2m67twWh2wgXmjq_R%3Dc30a16mZMw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Italian - Missing special-words

2017-06-05 Thread ShreeDevi Kumar
File is there in langdata

https://github.com/tesseract-ocr/langdata/blob/master/ita/ita.special-words

and is referred to in the language config file

https://github.com/tesseract-ocr/langdata/blob/master/ita/ita.config



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jun 5, 2017 at 7:29 PM, John Muccigrosso  wrote:

> Checking in on this. It's still occurring for me with italian on OS
> X, Tesseract Open Source OCR Engine v3.05.00 with Leptonic.
>
> Error: failed to load /usr/local/Cellar/tesseract/3.05.00_1/share/tessdata
> /ita.special-words
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/ms
> gid/tesseract-ocr/b0b70fd2-5c5e-4fcf-8869-3f852194e141%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWT0Rsarwuiy6e2Dy%2Bx1fj9F%3DsX0GLoxU_6tx0x9dGJ6A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Detection Using LSTM Files

2017-06-05 Thread ShreeDevi Kumar
>assume that I have creates  20 LSTM files for English for example, each
LSTM file is for a different font, when I make detection against an image
by running the command: *tesseract image results -l eng--tessdata-dir
./tessdata --oem 1* does the tesseract check the image against all LSTM
files, or just take one of them and make detection against it?

​the .lstmf files are created per font​/image. lstmtraining processes all
of them together to create one .lstm file for the language.

Maybe, internally it keeps the .lstmf files. I do not know whether it
checks against just of them or creates a combined version to use for
recognition


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jun 5, 2017 at 7:05 PM, ShreeDevi Kumar <shreesh...@gmail.com>
wrote:

> Comments from Ray regarding training text
>
> > For Latin-based languages, the existing model data provided has been
> trained on about 40 textlines spanning about 4500 fonts
> <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>.
> For other scripts, not so many fonts are available, but they have still
> been trained on a similar number of textlines. Instead of taking a few
> minutes to a couple of hours to train, Tesseract 4.00 takes a few *days* to
> a couple of *weeks.*
>
> >The text corpus is from *all* the www, taken several years ago, plus more
> recent data from wiki-something. The text is divided by language
> automatically, so there is a separate stream for each of the
> Devanagari-based languages (as there is for the Latin-based languages) and
> clipped to 1GB for each language. For each language, the text is frequency
> counted and cleaned by multiple methods, and sometimes this cleaning is too
> stringent automatically, or not stringent enough, so forbidden_characters
> and desired_characters are used as a guide in the cleanup process. There
> are other lang-specific numbers like a 1-in-n discard ratio for the
> frequency. For some languages, the amount of data produced at the end is
> very thin.
> ​>​
> The unicharset is extracted from what remains, and the wordlist that is
> published in langdata.
> ​>​
> For the LSTM training, I resorted to using Google's parallel
> infrastructure to render enough text in all the languages.
> ​>​
> However much or little corpus text there is, the rendering process makes
> 5 chunks of 50 words to render in a different combination of font and
> random degradation, which results in 40-80 rendered textlines. The
> words are chosen to approximately echo the real frequency of conjunct
> clusters (characters in most languages) in the source text, while also
> using the most frequent words.
> ​>​
> This process is all done without significant manual intervention, but
> counts of the number of generated textlines indicates when it has gone
> badly, usually due to a lack of fonts, or a lack of corpus text. I recently
> stopped training chr, iku, khm, mya after discovering that I have no
> rendered textlines that contain anything other than digits and punctuation.
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Mon, Jun 5, 2017 at 4:59 PM, Ibr <ibr.ham...@gmail.com> wrote:
>
>> Hi,
>>
>> assume that I have creates  20 LSTM files for English for example, each
>> LSTM file is for a different font, when I make detection against an image
>> by running the command: *tesseract image results -l eng--tessdata-dir
>> ./tessdata --oem 1* does the tesseract check the image against all LSTM
>> files, or just take one of them and make detection against it?
>>
>> I'm assuming that to make the detection is more accurate I should create
>> many LSTM files for different fonts, because images can be with different
>> fonts from each other so in this way it would be more accurate since I have
>> LSTM file for every possible font, is that correct?
>>
>> Thanks
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/729fe287-e7b1-4f06-903b-25151b8126c6%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/729fe287-e7b1-4f06-903b-25151b8126c6%40googlegrou

Re: [tesseract-ocr] Same Font with Multible Styles

2017-06-01 Thread ShreeDevi Kumar
text2image --list_available_fonts --fonts_dir /mnt/c/Windows/Fonts

replace the fonts directory with your fonts location

eg.

633: Times New Roman,
634: Times New Roman, Bold
635: Times New Roman, Bold Italic
636: Times New Roman, Italic
637: Trajan Pro
638: Trajan Pro Bold
639: Trebuchet MS
640: Trebuchet MS Bold
641: Trebuchet MS Bold Italic
642: Trebuchet MS Italic
643: Tungsten
644: Tw Cen MT
645: Tw Cen MT Bold
646: Tw Cen MT Bold Italic
647: Tw Cen MT Condensed Extra Bold,
648: Tw Cen MT Condensed,
649: Tw Cen MT Condensed, Bold
650: Tw Cen MT Italic


See if the font list has the font names such as test regular, test bold and
test italic. If so, use those names in your fontlist for training.

If they are all listed as test, then it may not work.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jun 1, 2017 at 6:36 PM, Ibr  wrote:

> Hi,
>
> If we assume that we have set of fonts files, and all of there fonts files
> are for the same font, but each one of them is for a different style, for
> example if we have font "test" there will be file for test regular, and
> file for test bold and file for test italic, but all of these files or
> styles have the same font name which is "test", if I installed them all on
> the machine, and created single LSTM file for the font test, will the
> tesseract create the LSTM file for all styles or just one of these styles.
>
> Keep in mind I cant train for every style since all styles have the same
> name "test" and all styles are already installed on the machine
>
> Thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/dbcfc0c6-1e29-45bc-b86f-0f2a86a1209a%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV6jTU8Cc0nyTjV-%2BzsxGGM%3DPnjKpbXPBSvhNDOK%2BjuvQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to add space between strings of document. Punjabi (Gurmukhi) language have a space issue, after ocr the image it is showing no space b/w the text.

2017-06-01 Thread ShreeDevi Kumar
Read https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

Follow the tutorials.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV%3Dj4fFTxsBV-bGgdOZ%2BZyro14htD4mEGcaeERXftBsqQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to add space between strings of document. Punjabi (Gurmukhi) language have a space issue, after ocr the image it is showing no space b/w the text.

2017-06-01 Thread ShreeDevi Kumar
Are you training for 3.0 or 4.0?

Do you have spaces between the letters in your training text?

Read https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jun 1, 2017 at 2:48 PM, Mandeep Singh  wrote:

>
> ohhh Thank you very much it is working. many many thanks to you.
>
>
> but i have more questions.
>
> 1. if i am training new data still there is space problem.
>
> 2. How do i add more data in pan.traindata or can i edit existing
> traindata?
>
> On Thursday, 1 June 2017 14:34:14 UTC+5:30, shree wrote:
>>
>> https://github.com/tesseract-ocr/tessdata
>>
>> has the traineddata for 4.0.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/8edf8aed-3531-4be0-be96-7a7025769173%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUCWzaRJV%2Bf6vBO2J8ux5OxWiGr7D1ZhG%2BMg8E9nt7Ypw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to add space between strings of document. Punjabi (Gurmukhi) language have a space issue, after ocr the image it is showing no space b/w the text.

2017-06-01 Thread ShreeDevi Kumar
https://github.com/tesseract-ocr/tessdata

has the traineddata for 4.0.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXuMFdT_dTTckA-cJ3Zpq57yFOQJ4CEc4svgKVqGEwJ4w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to add space between strings of document. Punjabi (Gurmukhi) language have a space issue, after ocr the image it is showing no space b/w the text.

2017-06-01 Thread ShreeDevi Kumar
Please read the wiki links I sent.

If you have installed tesseract 4.0, please test first with the provided
traineddata for Punjabi before trying to train.

Most times, existing traineddata provides the best result.



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jun 1, 2017 at 2:16 PM, Mandeep Singh <mandeep5...@gmail.com> wrote:

> i had install tesseract.exe 4.0 on my system after that i am using
> jTessBoxEditor 2.0 for training data punjabi language. Thats it. i dont
> what does it mean by lstm? please guide me
>
> On Thursday, 1 June 2017 14:04:34 UTC+5:30, shree wrote:
>>
>> Are you using the 4.0 version of tesseract with --oem 1 (LSTM engine
>> only)?
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Thu, Jun 1, 2017 at 1:13 PM, Mandeep Singh <mande...@gmail.com> wrote:
>>
>>> kindly view this issue or please guide me how do i add config file for
>>> punjabi language.
>>>
>>>
>>> On Thursday, 1 June 2017 11:40:22 UTC+5:30, Mandeep Singh wrote:
>>>>
>>>>
>>>> There is still space issue. kindly review this attachment .
>>>>
>>>>
>>>> Please help me out .
>>>>
>>>>
>>>> On Wednesday, 31 May 2017 18:11:10 UTC+5:30, shree wrote:
>>>>>
>>>>> Use --oem 1 (LSTM engine) with tesseract 4.0. You will get correct
>>>>> output.
>>>>>
>>>>> Use for command line interface
>>>>>
>>>>> binaries from https://github.com/UB-Man
>>>>> nheim/tesseract/wiki
>>>>>
>>>>> Use for GUI - look for tesseract 4.0 versions
>>>>>
>>>>>   gImagesReader  https://github.com/manisandro/
>>>>> gImageReader/releases
>>>>>
>>>>>   VietOCR https://sourceforge.ne
>>>>> t/projects/vietocr/files/vietocr/5.0alpha/
>>>>>
>>>>>
>>>>>
>>>>> ShreeDevi
>>>>> 
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>>> On Wed, May 31, 2017 at 5:05 PM, ShreeDevi Kumar <shree...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage
>>>>>>
>>>>>> https://github.com/tesseract-ocr/tesseract/wiki
>>>>>>
>>>>>> https://github.com/UB-Mannheim/tesseract/wiki
>>>>>>
>>>>>> https://github.com/manisandro/gImageReader/releases
>>>>>>
>>>>>> ShreeDevi
>>>>>> 
>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>
>>>>>> On Wed, May 31, 2017 at 4:16 PM, Mandeep Singh <mande...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> kindly provide me your email address i want to discuss with this
>>>>>>> issue. yes i used 3.04 and what does it mean PSM?
>>>>>>>
>>>>>>> On Wednesday, 31 May 2017 15:54:54 UTC+5:30, shree wrote:
>>>>>>>>
>>>>>>>> The output you posted, is it using the 3.04 traineddata from repo?
>>>>>>>>
>>>>>>>> What PSM did you use?
>>>>>>>>
>>>>>>>> Try using the experimental tesseract4 version for windows , see
>>>>>>>> wiki for links.
>>>>>>>>
>>>>>>>> On May 31, 2017 3:47 PM, "Mandeep Singh" <mande...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I am using Window 8.1 and tesseract version 3.04.
>>>>>>>>>
>>>>>>>>> i am training the data with jTessBox editor and another method
>>>>>>>>> with C# Serak Trainer , but i didn't find any good solutions. There is
>>>>>>>>> major issue space.
>>>>>>>>>
>>>>>>>>> On Wednesday, 24 May 2017 11:44:42 UTC+5:30, shree wrote:
>>>>>>>>

Re: [tesseract-ocr] How to add space between strings of document. Punjabi (Gurmukhi) language have a space issue, after ocr the image it is showing no space b/w the text.

2017-06-01 Thread ShreeDevi Kumar
Are you using the 4.0 version of tesseract with --oem 1 (LSTM engine only)?

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jun 1, 2017 at 1:13 PM, Mandeep Singh <mandeep5...@gmail.com> wrote:

> kindly view this issue or please guide me how do i add config file for
> punjabi language.
>
>
> On Thursday, 1 June 2017 11:40:22 UTC+5:30, Mandeep Singh wrote:
>>
>>
>> There is still space issue. kindly review this attachment .
>>
>>
>> Please help me out .
>>
>>
>> On Wednesday, 31 May 2017 18:11:10 UTC+5:30, shree wrote:
>>>
>>> Use --oem 1 (LSTM engine) with tesseract 4.0. You will get correct
>>> output.
>>>
>>> Use for command line interface
>>>
>>> binaries from https://github.com/UB-Man
>>> nheim/tesseract/wiki
>>>
>>> Use for GUI - look for tesseract 4.0 versions
>>>
>>>   gImagesReader  https://github.com/manisandro/
>>> gImageReader/releases
>>>
>>>   VietOCR https://sourceforge.ne
>>> t/projects/vietocr/files/vietocr/5.0alpha/
>>>
>>>
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Wed, May 31, 2017 at 5:05 PM, ShreeDevi Kumar <shree...@gmail.com>
>>> wrote:
>>>
>>>> https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage
>>>>
>>>> https://github.com/tesseract-ocr/tesseract/wiki
>>>>
>>>> https://github.com/UB-Mannheim/tesseract/wiki
>>>>
>>>> https://github.com/manisandro/gImageReader/releases
>>>>
>>>> ShreeDevi
>>>> 
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>> On Wed, May 31, 2017 at 4:16 PM, Mandeep Singh <mande...@gmail.com>
>>>> wrote:
>>>>
>>>>> kindly provide me your email address i want to discuss with this
>>>>> issue. yes i used 3.04 and what does it mean PSM?
>>>>>
>>>>> On Wednesday, 31 May 2017 15:54:54 UTC+5:30, shree wrote:
>>>>>>
>>>>>> The output you posted, is it using the 3.04 traineddata from repo?
>>>>>>
>>>>>> What PSM did you use?
>>>>>>
>>>>>> Try using the experimental tesseract4 version for windows , see wiki
>>>>>> for links.
>>>>>>
>>>>>> On May 31, 2017 3:47 PM, "Mandeep Singh" <mande...@gmail.com> wrote:
>>>>>>
>>>>>>> I am using Window 8.1 and tesseract version 3.04.
>>>>>>>
>>>>>>> i am training the data with jTessBox editor and another method with
>>>>>>> C# Serak Trainer , but i didn't find any good solutions. There is major
>>>>>>> issue space.
>>>>>>>
>>>>>>> On Wednesday, 24 May 2017 11:44:42 UTC+5:30, shree wrote:
>>>>>>>>
>>>>>>>> Which O/S?
>>>>>>>> Which version of Tesseract?
>>>>>>>> How are you training?
>>>>>>>>
>>>>>>>> Have you tried the packaged traineddata for Punjabi? What result do
>>>>>>>> you get with that?
>>>>>>>>
>>>>>>>> ShreeDevi
>>>>>>>> 
>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>
>>>>>>>> On Wed, May 24, 2017 at 10:14 AM, Mandeep Singh <mande...@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hello Guys,
>>>>>>>>>
>>>>>>>>> I am training data for Punjabi language i am getting space issue.
>>>>>>>>> How do i edit config file and how do i make own personel config file 
>>>>>>>>> for my
>>>>>>>>> own custom language. Please assist me.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Output is : ੳਸਦਡਗ
>>>>>>>>> i want and i assume

Re: [tesseract-ocr] Unable to find reference to C++ standard functions when building tesseract 4.00alpha

2017-06-01 Thread ShreeDevi Kumar
Does configure need any change?? See earlier messages for details.

>> i can't manage to get an option for ./configure to use g++ instead of
gcc. If somebody knows how, i would be grateful.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV2koa8ty86ASXV-POBAu4C9w%2BwFU0mrOescMSd2xW1CQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Unable to find reference to C++ standard functions when building tesseract 4.00alpha

2017-05-31 Thread ShreeDevi Kumar
Supported Compilers

   - GCC 4.8 and above
   - Clang 3.4 and above
   - MSVC 2015, 2017

Other compilers might work, but are not officially supported.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 31, 2017 at 7:01 PM, Youcef  wrote:

> Hi ShreeDevi,
>
> Thanks for your answer.
> I re-did the same, but pulling code as you suggested... but problems still
> present.
>
>
> Le mercredi 31 mai 2017 15:14:07 UTC+2, shree a écrit :
>>
>> *git pull origin*
>> to get the latest source. I have built it today without any problems.
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Wed, May 31, 2017 at 6:32 PM, Youcef  wrote:
>>
>>> Hi,
>>>
>>> I'm trying to build tesseract from sources.
>>> I succeeded in building Leptonica 1.74.1 and installing into
>>> /usr/local/bin and /usr/local/include.
>>>
>>> Into Tesseract main fodler, the first commands are ok:
>>>
>>> ./autogen.sh
>>> ./configure
>>>
>>>
>>> But the problem comes when i run following command :
>>>
>>> LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make
>>>
>>>
>>> things are compiling well but I get some unreferenced functions at the
>>> end like:
>>>
>>> /home/user/tesseract-ocr/tesseract/api/../ccutil/genericvector.h:659:
>>> undefined reference to `operator new[](unsigned long)'
>>>
>>> and a lot of standard C++ functions unreferenced like :
>>>
>>>  ./.libs/libtesseract.so: undefined reference to
>>> `std::basic_ifstream::~basic_ifstream()'
>>>
>>> I have tried other suggested solutions without any success
>>>
>>> - running LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make
>>> - removing old tesseract previously installed with apt-get
>>>
>>> Thanks for any help.
>>> Regards
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/eb5d84ee-bd79-4a6d-88db-cdfa0950bad3%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/e2aa90e2-34fe-43b6-a82f-368147e4df45%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWi-rrWUC2BJ75MwryjCg1j0TV42VZn3Gmq7VmmFUN1GQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Unable to find reference to C++ standard functions when building tesseract 4.00alpha

2017-05-31 Thread ShreeDevi Kumar
*git pull origin*
to get the latest source. I have built it today without any problems.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 31, 2017 at 6:32 PM, Youcef  wrote:

> Hi,
>
> I'm trying to build tesseract from sources.
> I succeeded in building Leptonica 1.74.1 and installing into
> /usr/local/bin and /usr/local/include.
>
> Into Tesseract main fodler, the first commands are ok:
>
> ./autogen.sh
> ./configure
>
>
> But the problem comes when i run following command :
>
> LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make
>
>
> things are compiling well but I get some unreferenced functions at the end
> like:
>
> /home/user/tesseract-ocr/tesseract/api/../ccutil/genericvector.h:659:
> undefined reference to `operator new[](unsigned long)'
>
> and a lot of standard C++ functions unreferenced like :
>
>  ./.libs/libtesseract.so: undefined reference to
> `std::basic_ifstream::~basic_ifstream()'
>
> I have tried other suggested solutions without any success
>
> - running LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make
> - removing old tesseract previously installed with apt-get
>
> Thanks for any help.
> Regards
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/eb5d84ee-bd79-4a6d-88db-cdfa0950bad3%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVUn3OphjY%2BBD4vHE2a%2B6hUner8eA0KNG9L6b7C5frwHw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to add space between strings of document. Punjabi (Gurmukhi) language have a space issue, after ocr the image it is showing no space b/w the text.

2017-05-31 Thread ShreeDevi Kumar
Use --oem 1 (LSTM engine) with tesseract 4.0. You will get correct output.

Use for command line interface

binaries from https://github.com/UB-
Mannheim/tesseract/wiki

Use for GUI - look for tesseract 4.0 versions

  gImagesReader  https://github.com/manisandro/
gImageReader/releases

  VietOCR
https://sourceforge.net/projects/vietocr/files/vietocr/5.0alpha/



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 31, 2017 at 5:05 PM, ShreeDevi Kumar <shreesh...@gmail.com>
wrote:

> https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage
>
> https://github.com/tesseract-ocr/tesseract/wiki
>
> https://github.com/UB-Mannheim/tesseract/wiki
>
> https://github.com/manisandro/gImageReader/releases
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Wed, May 31, 2017 at 4:16 PM, Mandeep Singh <mandeep5...@gmail.com>
> wrote:
>
>> kindly provide me your email address i want to discuss with this issue.
>> yes i used 3.04 and what does it mean PSM?
>>
>> On Wednesday, 31 May 2017 15:54:54 UTC+5:30, shree wrote:
>>>
>>> The output you posted, is it using the 3.04 traineddata from repo?
>>>
>>> What PSM did you use?
>>>
>>> Try using the experimental tesseract4 version for windows , see wiki for
>>> links.
>>>
>>> On May 31, 2017 3:47 PM, "Mandeep Singh" <mande...@gmail.com> wrote:
>>>
>>>> I am using Window 8.1 and tesseract version 3.04.
>>>>
>>>> i am training the data with jTessBox editor and another method with C#
>>>> Serak Trainer , but i didn't find any good solutions. There is major issue
>>>> space.
>>>>
>>>> On Wednesday, 24 May 2017 11:44:42 UTC+5:30, shree wrote:
>>>>>
>>>>> Which O/S?
>>>>> Which version of Tesseract?
>>>>> How are you training?
>>>>>
>>>>> Have you tried the packaged traineddata for Punjabi? What result do
>>>>> you get with that?
>>>>>
>>>>> ShreeDevi
>>>>> 
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>>> On Wed, May 24, 2017 at 10:14 AM, Mandeep Singh <mande...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello Guys,
>>>>>>
>>>>>> I am training data for Punjabi language i am getting space issue. How
>>>>>> do i edit config file and how do i make own personel config file for my 
>>>>>> own
>>>>>> custom language. Please assist me.
>>>>>>
>>>>>>
>>>>>> Output is : ੳਸਦਡਗ
>>>>>> i want and i assume output like this => ੳ ਸ ਦ ਡ ਗ
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/9e0aa40e-85e
>>>>>> 8-4659-87fb-9b586817e377%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/9e0aa40e-85e8-4659-87fb-9b586817e377%40googlegroups.com?utm_medium=email_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>>> gid/

Re: [tesseract-ocr] How to add space between strings of document. Punjabi (Gurmukhi) language have a space issue, after ocr the image it is showing no space b/w the text.

2017-05-31 Thread ShreeDevi Kumar
https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage

https://github.com/tesseract-ocr/tesseract/wiki

https://github.com/UB-Mannheim/tesseract/wiki

https://github.com/manisandro/gImageReader/releases

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 31, 2017 at 4:16 PM, Mandeep Singh 
wrote:

> kindly provide me your email address i want to discuss with this issue.
> yes i used 3.04 and what does it mean PSM?
>
> On Wednesday, 31 May 2017 15:54:54 UTC+5:30, shree wrote:
>>
>> The output you posted, is it using the 3.04 traineddata from repo?
>>
>> What PSM did you use?
>>
>> Try using the experimental tesseract4 version for windows , see wiki for
>> links.
>>
>> On May 31, 2017 3:47 PM, "Mandeep Singh"  wrote:
>>
>>> I am using Window 8.1 and tesseract version 3.04.
>>>
>>> i am training the data with jTessBox editor and another method with C#
>>> Serak Trainer , but i didn't find any good solutions. There is major issue
>>> space.
>>>
>>> On Wednesday, 24 May 2017 11:44:42 UTC+5:30, shree wrote:

 Which O/S?
 Which version of Tesseract?
 How are you training?

 Have you tried the packaged traineddata for Punjabi? What result do you
 get with that?

 ShreeDevi
 
 भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

 On Wed, May 24, 2017 at 10:14 AM, Mandeep Singh 
 wrote:

> Hello Guys,
>
> I am training data for Punjabi language i am getting space issue. How
> do i edit config file and how do i make own personel config file for my 
> own
> custom language. Please assist me.
>
>
> Output is : ੳਸਦਡਗ
> i want and i assume output like this => ੳ ਸ ਦ ਡ ਗ
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/9e0aa40e-85e
> 8-4659-87fb-9b586817e377%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

 --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/e55f65ea-4e04-477e-9b50-7e5b96f75925%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/f0ced612-c6d2-4c8b-8435-e1c61dad4027%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXufEvYX-zXUaOQeuMEKg2ijH7Rm-SymhE7JYTT5XCE0w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to add space between strings of document. Punjabi (Gurmukhi) language have a space issue, after ocr the image it is showing no space b/w the text.

2017-05-31 Thread ShreeDevi Kumar
The output you posted, is it using the 3.04 traineddata from repo?

What PSM did you use?

Try using the experimental tesseract4 version for windows , see wiki for
links.

On May 31, 2017 3:47 PM, "Mandeep Singh"  wrote:

> I am using Window 8.1 and tesseract version 3.04.
>
> i am training the data with jTessBox editor and another method with C#
> Serak Trainer , but i didn't find any good solutions. There is major issue
> space.
>
> On Wednesday, 24 May 2017 11:44:42 UTC+5:30, shree wrote:
>>
>> Which O/S?
>> Which version of Tesseract?
>> How are you training?
>>
>> Have you tried the packaged traineddata for Punjabi? What result do you
>> get with that?
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Wed, May 24, 2017 at 10:14 AM, Mandeep Singh 
>> wrote:
>>
>>> Hello Guys,
>>>
>>> I am training data for Punjabi language i am getting space issue. How do
>>> i edit config file and how do i make own personel config file for my own
>>> custom language. Please assist me.
>>>
>>>
>>> Output is : ੳਸਦਡਗ
>>> i want and i assume output like this => ੳ ਸ ਦ ਡ ਗ
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/9e0aa40e-85e8-4659-87fb-9b586817e377%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/e55f65ea-4e04-477e-9b50-7e5b96f75925%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUht9hMa%3DGxWbEMAFbLd8o2GSiyvLK6JTqRMgb5VKaY9g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: user-words

2017-05-31 Thread ShreeDevi Kumar
Samuel,

Do the user-words work as expected after making this change?

Which version of tesseract are you using?

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 31, 2017 at 2:35 AM, Samuel backus 
wrote:

> I had to recompile tesseract after updating dict.h and dict.cpp for this
> change to take effect.
>
> On Monday, October 3, 2011 at 3:20:05 AM UTC-4, Slavko Kocjancic wrote:
>>
>> Dne 2.10.2011 1:36, pi�e B.J.:
>> > I ran into this problem recently.  Here is the solution (I'm using
>> > Tesseract 3.01):
>> > to use user-words list, in dict.h and dict.cpp, find user_words_suffix
>> > and change the "" to "user-words"
>> > //dict.h
>> > STRING_VAR_H(user_words_suffix, "user-words", "A list of user-provided
>> > words.");
>> >
>> > //dict.cpp
>> > STRING_INIT_MEMBER(user_words_suffix, "user-words",
>> >"A list of user-provided words.",
>> >getImage()->getCCUtil()->params()),
>> >
>> > This assumes, then, that in your tessdata folder there is a file named
>> > "eng.user-words" with your user made word list.
>> >
>> > .bj.
>> >
>>
>> I have 3.01 from svn too.
>> And that field's are empty. So I modified as you suggest. But I see no
>> difference in OCR. The confidence is still low and missreaded word is
>> still missreaded.
>> And if I remove 'eng.user-words' then tess just abort execution with
>> missing eng.user-words statments so I assume that file is oppened and
>> used.
>>
>> So is there someone smart enought to explain how that
>> ('lang.user-words') works.
>> And other things.. Is there someone smart enought to change source on
>> svn to have that included but just to check if user-words exist not to
>> popup error? (as I know the lang.user-words is optional so keep is like
>> that.)
>>
>> Thanks...
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/18a7aac6-cc5d-4904-985e-4bb6ea1bccde%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUptO_NGUA6%3DeAbHzX4q6GcVSedW%3Dac_MfrvnmYFUxH3A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How recognize footnotes

2017-05-30 Thread ShreeDevi Kumar
Try the `hocr` output and see if it provides some of what you need.

I don't think tesseract will link to footnotes though it may recognize the
text.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 30, 2017 at 7:20 PM, Felipe Ghiardo  wrote:

> Hi all.
>
> Using another ocr engines (abby, for ex.), the process recognize the
> footnotes and make the link. Also recognize header and footer. The answer
> is how can i do the same with tesseract, at least with the footnotes. IIts
> something that one can train? And how do you do it? Thanks for the help
> (and sorry for my english).
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/dfaec4b7-77a2-4f01-be40-cf2fe1809ddd%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV4si-YzpOfJNwhH7WBM5J8ab%2BCNuUETF_jrhObTG3SEg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Fine-turning LSTM for Japanese

2017-05-28 Thread ShreeDevi Kumar
Ray is the best person to answer your questions. I can only share my
experience trying to train using Devanagari script.

Fine Tune will work if all you want to change is a font, with the same
unicharset. This works well for Latin script based languages but not
complex scripts.

eg. for devanagari, the consonants, vowel marks, combining marks together
make an 'akshara' glyph, the unicharset in the language model has these. If
the new training text has additional new akshara glyphs, fine tune training
gives errors such as Encoding of string failed!

For Devanagari, I have tried training by changing top layer. This adds the
new akshara glyphs. However, for accuracy training has to be done till
0.01% which takes very long - I have not been able to reach that level of
accuracy in my training. Again, this may impact the originally trained
fonts. Currently using --eval_listfile for a different set of images during
training does not work.

-dawgs are a way of compressing the wordlists.
https://tesseract-ocr.repairfaq.org/allaboutdawg.html

There is no way to finetune the legacy engine.



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, May 29, 2017 at 9:19 AM, Akira Hayakawa  wrote:

> Thanks for the reply. I understand.
>
> There are couple of questions related to this topic.
>
> 1)
>
> training_text may only include the text for the next (or new) learning?
> For example, the LSTM net have learned a line "I have a pen" and we need
> it to learn a line "I have a pineapple" then does training_text only
> include the pineapple line but the pen line is removed?
>
> 2)
>
> In https://github.com/tesseract-ocr/tesseract/wiki/Training-
> Tesseract-%E2%80%93-tesstrain.sh
>
> the files in langdata other than training_text are said to be optional.
> I suppose these files are internally handled as hints. Am I right?
> And what if these files are inconsistent with training_text? For example,
> wordlist may contain fairly irrelevant words.
> Should I erase the optional files if they are inconsistent?
>
> 3)
>
> Closely related to 2).
> When the langdata doesn't have these optional files. Tesseract internally
> generates the files from training_text?
>
> 4)
>
> Is there no way to fine-tune legacy tesseract?
>
> 5)
>
> In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>
> These is a note:
>
>> NOTE Tesseract 4.00 will now run happily with a traineddata file that
>> contains just lang.lstm.The lstm-*-dawgs are optional, and none of the
>> other files are required or used with OEM_LSTM_ONLY as the OCR engine mode. 
>> No
>> bigrams, unichar ambigs or any of the other files are needed or even have
>> any effect if present.
>
>
> Does this mean if we use LSTM only (legacy tesseract is going to be purged
> in the future release right?), the optionals files like wordlist are
> entirely needless? This sounds natural to me because as far as I understand
> the LSTM net only learn a text line from a sequence of byte or image.
> btw, What does "dawgs" mean?
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/5d410061-f281-42bd-98f5-04a746700dca%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUuFn1Fxpv5h-RdHA%3DvZ%3DgY8TBq%2Bj%3DwCPrwmLP7TZF%2BcQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Fine-turning LSTM for Japanese

2017-05-28 Thread ShreeDevi Kumar
Please see inline replies:

On Sun, May 28, 2017 at 4:53 PM, Akira Hayakawa  wrote:

> I am new to tesseract. My aim is to use this software to analyze Japanese
> doc. The idea in my mind is to start from existing model and fine-tune it
> by new words that weren't correctly recognized.
>
> I am reading the Wiki and have some questions.
>
> 1)
>
> In https://github.com/tesseract-ocr/tesseract/wiki/
> TrainingTesseract-4.00---Finetune
>
>  you add training_text to tesstrain.sh
>
> training/tesstrain.sh \
>> --fonts_dir /usr/share/fonts \
>> --training_text ../langdata/ara/ara.training_text \
>> --langdata_dir ../langdata \
>> --tessdata_dir ./tessdata \
>> --lang ara \
>> --linedata_only \
>> --noextract_font_properties \
>> --exposures "0" \
>> --fontlist "Arial" \
>> --output_dir ~/tesstutorial/aratest
>
>
> but
>
> In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>
> You don't. Why?
>
> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng
> --linedata_only \
> --noextract_font_properties --langdata_dir ../langdata \
> --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
>
> My understanding is
>
> 1. tesstrain.sh uses text2image command internally to generate images
> which are in various fonts and reshaped.
> 2. --linedata_only splits the training text into line and makes images for
> each line.
> 3. langdata_dir is essential but training_text isn't. If training_test
> isn't found, it uses the default $lang/$lang.training_text.
>
> Am I correct?
>

​Yes, you are correct.​

>
> 2)
>
> In the above example, I couldn't have an idea why it should take
> --tessdata because it seems irrelevant to making training data.
>

​tesseract needs eng and osd traineddata during initialization. The
location can be specified via TESSDATA_PREFIX also.​

>
> 3)
>
> In https://github.com/tesseract-ocr/tesseract/wiki/
> TrainingTesseract-4.00---Finetune
>
> It says the reader should place each projects like this
>
> ./langdata
>> ./langdata/eng
>> ./langdata/ara
>> ./tessdata
>> ./tesseract
>> ./tesseract/tessdata
>> ./tesseract/tessdata/configs/
>> ./tesseract/training
>> etc
>
>
​That will be the directory structure if you were to clone the tesseract,
langdata and tessdata repositories.

It is not recommended to clone the whole tessdata repo (over 1 gb), you can
download the traineddata files for the languages you need.​

>
> and all the following examples are run under tesseract directory. Then I
> think the examples should take ../tessdata as --tessdata_dir but
> ./tessdata. I mean the examples should be fixed.
>
>
​./tessdata (in tesseract repo) does not have any traineddata files to
begin with.

You can change the directories to match your directory configuration.​



> 4)
>
> In In https://github.com/tesseract-ocr/tesseract/wiki/
> TrainingTesseract-4.00---Finetune
>
> combine_tessdata -e ../tessdata/ara.traineddata \
>> ~/tesstutorial/aratuned_from_ara/ara.lstm
>
>
> This is explained as it extracts the existing LSTM model for Arabic from
> tessdata but how come?
> The combine_tessdata commands extracts LSTM model because the extension of
> the second parameter is .lstm?
>

​Yes.​

>
> Another question here is why LSTM model is mixed in the traineddata? I
> think the traineddata file mixes legacy trained model and LSTM model and I
> am wondering why they aren't separated? Even if the user only uses LSTM
> both trained model are read? (is it light-weight? then it might be ok)
>

​The 4.0 code is in alpha stage of testing and supports both legacy engine
and new LSTM engine and the traineddata file has both models.

You can use combine_tessdata to keep only the LSTM model in the traineddata.
​

-- 
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/2a55760b-371b-483d-b5e2-731110bc83a4%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVqHs9HeBisZm2ikPBN8tnbbaqYrpjg0U0pG6%3DqDYAnDQ%40mail.gmail.com.
For more options, 

Re: [tesseract-ocr] Re: Cube training tools

2017-05-26 Thread ShreeDevi Kumar
> Just give us the step followed to train the language eng, hin, ita, etc.,
in the present tessdata repo.

​As stated before, this information is not available. The training was done
at Google and details were not shared since it was to be superseded by the
new LSTM engine.

The answer is not going to change if you keep asking :-)​

Also see,
https://github.com/tesseract-ocr/tesseract/issues/40#issuecomment-263348132


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, May 26, 2017 at 3:25 PM, Merlin ArulPrakash <
amelimerlina...@gmail.com> wrote:

>
> Hi Zdenko,
>
>  Thanks for the info, But i already took those tessdata, so only
> asking for the support for train the cube data for other languages which
> doesn't have those cube related files.
>
> Just give us the step followed to train the language eng, hin, ita, etc.,
> in the present tessdata repo.
>
> Thanks and Regards,
> Merlin
>
> On Wednesday, May 24, 2017 at 6:49:04 PM UTC+5:30, zdenop wrote:
>>
>> Cube data were available only for few languages. Available data are can
>> be found in  https://github.com/tesseract-ocr/tessdata/tree/3.04.00
>>
>> Zdenko
>>
>> On Wed, May 24, 2017 at 2:54 PM, ShreeDevi Kumar <shree...@gmail.com>
>> wrote:
>>
>>> cube training is not supported, no information is available for it. It
>>> has been deleted from the latest code.
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Wed, May 24, 2017 at 2:51 PM, Merlin ArulPrakash <
>>> amelime...@gmail.com> wrote:
>>>
>>>> Hi ,
>>>>
>>>>   Whether there is any tool for training cube data for tesseract? since
>>>> i am in need of getting trained data for Engine mode (both -
>>>> TesseractAndCube) to all the languages in tessdata, If anyone already have
>>>> cube data file kindly share with me, or share me the tool or procedure to
>>>> get the Cube trained data for other language except English.
>>>>
>>>>
>>>> Thanks in Advance,
>>>> Merlin
>>>>
>>>> On Friday, December 5, 2014 at 1:33:02 PM UTC+5:30, Emil Julius wrote:
>>>>>
>>>>> Hey, I'm currently planning on writing some training tools for the
>>>>> Cube engine. But I would like to be sure that I'm not reinventing the
>>>>> wheel, as the only documentation I was able to find was:
>>>>> https://code.google.com/p/tesseract-ocr-extradocs/wiki/Cube
>>>>> Which, I believe is written by one of the guys in this google group?
>>>>> I'm currently prioritizing tools for:
>>>>> * cube.size (one of the 2 bigram files)
>>>>> * cube.bigrams
>>>>>
>>>>> The tool for cube.bigrams is gonna be designed to take a plain text
>>>>> input file, and then calculate the bigrams and their frequency, then 
>>>>> output
>>>>> in the according file format
>>>>>
>>>>> I'm still trying to figure out a smart way to train the cube.size
>>>>> files, help is very welcome ;-).
>>>>>
>>>>> Also, what's the current state of the Tesseract project in general?
>>>>>
>>>>> Sincerly
>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>>> gid/tesseract-ocr/1cbb6195-f72d-4dc5-927d-895b941e5695%40goo
>>>> glegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1cbb6195-f72d-4dc5-927d-895b941e5695%40googlegroups.com?utm_medium=email_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc.

Re: [tesseract-ocr] How to extend the output format

2017-05-25 Thread ShreeDevi Kumar
tesseract writes the file names to console, you can try the following:

tesseract list.txt  stdout  > output.txt 2>&1

or

 tesseract list.txt  stdout -c include_page_breaks=1 > output.txt 2>&1




ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, May 25, 2017 at 4:10 PM, Matan  wrote:

> Hello,
> I'm using tesseract to process multiple images in one run.
> The problem is that the output provides me only with the results strings
> results, I can't connect the result to its original image.
>
> Is there a way to extend the results to provide more details such as,
> picture file name, path, etc..
>
> thanks!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/b648183c-eb13-4214-823c-d2903bed696b%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVZ91Kj97gBJhv5EH87SS3RDoFNxZpwu1etE9p%3DcJjwWA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Cube training tools

2017-05-24 Thread ShreeDevi Kumar
cube training is not supported, no information is available for it. It has
been deleted from the latest code.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 24, 2017 at 2:51 PM, Merlin ArulPrakash <
amelimerlina...@gmail.com> wrote:

> Hi ,
>
>   Whether there is any tool for training cube data for tesseract? since i
> am in need of getting trained data for Engine mode (both -
> TesseractAndCube) to all the languages in tessdata, If anyone already have
> cube data file kindly share with me, or share me the tool or procedure to
> get the Cube trained data for other language except English.
>
>
> Thanks in Advance,
> Merlin
>
> On Friday, December 5, 2014 at 1:33:02 PM UTC+5:30, Emil Julius wrote:
>>
>> Hey, I'm currently planning on writing some training tools for the Cube
>> engine. But I would like to be sure that I'm not reinventing the wheel, as
>> the only documentation I was able to find was: https://code.google.com/p
>> /tesseract-ocr-extradocs/wiki/Cube
>> Which, I believe is written by one of the guys in this google group?
>> I'm currently prioritizing tools for:
>> * cube.size (one of the 2 bigram files)
>> * cube.bigrams
>>
>> The tool for cube.bigrams is gonna be designed to take a plain text input
>> file, and then calculate the bigrams and their frequency, then output in
>> the according file format
>>
>> I'm still trying to figure out a smart way to train the cube.size files,
>> help is very welcome ;-).
>>
>> Also, what's the current state of the Tesseract project in general?
>>
>> Sincerly
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/1cbb6195-f72d-4dc5-927d-895b941e5695%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVc8ARAJyHotr8k1%3DZSqX8Ha1RpqYX-5d4jAu0QRCddzg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to add space between strings of document. Punjabi (Gurmukhi) language have a space issue, after ocr the image it is showing no space b/w the text.

2017-05-24 Thread ShreeDevi Kumar
Which O/S?
Which version of Tesseract?
How are you training?

Have you tried the packaged traineddata for Punjabi? What result do you get
with that?

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 24, 2017 at 10:14 AM, Mandeep Singh 
wrote:

> Hello Guys,
>
> I am training data for Punjabi language i am getting space issue. How do i
> edit config file and how do i make own personel config file for my own
> custom language. Please assist me.
>
>
> Output is : ੳਸਦਡਗ
> i want and i assume output like this => ੳ ਸ ਦ ਡ ਗ
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/9e0aa40e-85e8-4659-87fb-9b586817e377%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX%2BiD%2BBPpdbwvf1kgTU8LbJXQ%2BU9g0SemsqM-z%2B4jg_EA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Neural networks in tesseract 4.0

2017-05-22 Thread ShreeDevi Kumar
https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, May 22, 2017 at 8:31 PM, Thilina Jayathilaka <
dgtkjayathil...@gmail.com> wrote:

> The latest version of tesseract contains a customizable neural network as
> mentioned in it's documentation. What is the actual purpose of this- does
> it enhance the character recognition?
>
> Is there any guidelines documented on how to train/create the neural
> network for tesseract 4.0?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/5bc85869-1de6-4699-af78-72fe44b63a60%
> 40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWKf-0dcBskHabtrzQVnhixq0E2fGiTKwXDchYSw%2ByMRA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Generating a PDF with Tesseract C++

2017-05-22 Thread ShreeDevi Kumar
Look at the examples in

https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/2ArchitectureAndDataStructures.pdf

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, May 22, 2017 at 7:34 PM, Saliaj Adrian  wrote:

> Thank you but it still doesn't work...
>
> Can someone just tell me what should be add to this code to generate a PDF
> output ?
>
> #include 
> #include 
>
> int main()
> {
> fprintf(stderr, "Heyhey !\n");
> char * outText;
>
> tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
>
> if (api->Init(NULL, "fra")) {
> fprintf(stderr, "Could not initialize tesseract.\n");
> exit(1);
> }
>
> Pix * image = pixRead("/path/test.tif");
> api->SetImage(image);
>
> outText = api->GetUTF8Text();
> printf("OCR output:\n%s", outText);
>
> api->End();
> delete [] outText;
> pixDestroy();
>
> return 0;
> }
>
>
>
> Thanks a lot
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/9110d5a3-71f4-4f50-ab24-5bfc5164c50e%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWANsdmNxykUG%3D7dpJDTUOaaFe78sNi43X%3D%3D9LmHfxeHQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Training from scratch

2017-05-20 Thread ShreeDevi Kumar
also see
https://github.com/tesseract-ocr/tesseract/blob/master/contrib/genlangdata.pl

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, May 20, 2017 at 10:12 AM, ShreeDevi Kumar <shreesh...@gmail.com>
wrote:

> Google has not shared its method of training with complete scripts etc.
> The training instructions on wiki are only a tutorial for learning about
> LSTM training.
>
> Please also see https://github.com/tesseract-ocr/tesseract/issues/644
>
>
> ShreeDevi
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWersjhbQgh9jr0fZimmfPFDsxx14pEshu_1MSHB7nepg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Training from scratch

2017-05-19 Thread ShreeDevi Kumar
Google has not shared its method of training with complete scripts etc. The
training instructions on wiki are only a tutorial for learning about LSTM
training.

Please also see https://github.com/tesseract-ocr/tesseract/issues/644


ShreeDevi

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXy75qH65JALTE3zC2dX4kMLb9KU%3DFJ%2BQd%3D-foe3CHLMw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Training from scratch

2017-05-19 Thread ShreeDevi Kumar
As per Ray 4500 fonts and 40 lines of text were used to create the
models of latin scriipt based languages. So I am not sure whether you can
replicate the model.

For language specific exposure settings etc see

https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, May 19, 2017 at 8:22 AM,  wrote:

> If trainin tesseract 4 from scratch, English for example. I know I need to
> have the proper fonts installed, but what other parameters would be needed
> to produce the same model for English? Ie what exposure settings were used
> to degrade images etc?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/e8b28a60-7ebb-44ab-aa7a-9cebd2086cbb%
> 40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVvXrq5mBMfN-%2BeLSPmtxSdsjg9mCYmw1TREH0z1%3DLiyg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Tesseract 4 new Font

2017-05-17 Thread ShreeDevi Kumar
1. Which --oem are you using with tesseract 4, legacy engine or lstm?

--oem 0 or --oem 1

2. Is Brazilian Portuguese very different from Portuguese? Please see the
trainingtext and wordlists on
https://github.com/tesseract-ocr/langdata/tree/master/por

3. Provide a sample image with it's ground truth and point out the errors
in it. Is the image at 300 dpi?

4. Please share the box/tiff pair to test for training.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 17, 2017 at 2:49 AM, Maicon Azevedo 
wrote:

> Hello!
>
> Guys I have tesseract 4 on Ubuntu 16.04.
>
> Running the tesseract with  -l por (portuguese from Brazil) I don't have
> the good results. The image use other font than the trained data (I think).
>
> My question is. It's necessary to train tesseract again? I created the tif
> and box file with jtesseditor but I don't what I need to do with these
> files and how to write a good training data.  I sow the
> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
> but I didn't found any case similar with mine.
>
> Thanks in advance!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/a34d2a11-54d6-416f-87cd-164a8157aed6%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVn%3D6eSR-F3qtOt2XvJ%2BaC-%2BWUPtrKWm4CmHVu9ZQDCbA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] include tesseract ocr in visual studio 2010

2017-05-15 Thread ShreeDevi Kumar
Which version of tesseract, which source?

Tesseract 4, master branch does not support visual studio 2010, please
check the changelog.

You can try the 3.05 branch or newer visual studio.

On May 15, 2017 8:10 PM, "emna ouerteni"  wrote:

> include tesseract ocr in visual studio 2010
>
>
> i tried to add tesseract ocr to visual studio 2010
> the build is succeded but when i run  there is an error 0xc0150002
>
>
>
>
>
> 
>
>
>
> i tried to find th missing dll with dependency walker it shows
>
>
> 
>
>
>
>
>
>
>
>
> and
>
> Error: The Side-by-Side configuration information for
> "c:\users\documents\visual studio 
> 2010\projects\tess_open\debug\LIBTESSERACT302D.DLL"
> contains errors. L application n a pas pu d marrer car sa configuration c
> te- -c te est incorrecte. Pour plus d informations, consultez le journal d
> v nements d applications ou utilisez l outil de ligne de commande
> sxstrace.exe (14001). Error: The Side-by-Side configuration information
> for "c:\users\documents\visual studio 
> 2010\projects\tess_open\debug\LIBLEPT168D.DLL"
> contains errors. L application n a pas pu d marrer car sa configuration c
> te- -c te est incorrecte. Pour plus d informations, consultez le journal d
> v nements d applications ou utilisez l outil de ligne de commande
> sxstrace.exe (14001). Error: Modules with different CPU types were found. 
> Warning:
> At least one module has an unresolved import due to a missing export
> function in a delay-load dependent module.
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/28a2105e-5430-4925-9996-7b21cd32be79%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXBF5QAkF%3D9%2Bc9Jb0Tn3ZeD7osinQ5HUwuEbTv11_D4GA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Tesseract 4: Shuffling training instances and unicharset compression at the same time?

2017-05-12 Thread ShreeDevi Kumar
Please see
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

80 is the default. I think it means both 64 and 16 are applied.


train_mode int 80 Flags from TrainingFlags in lstmrecognizer.h Possible
values= 64 for Compress unicharset, 16 for round-robin training.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, May 12, 2017 at 1:46 PM, 'kolomiyets' via tesseract-ocr <
tesseract-ocr@googlegroups.com> wrote:

> Hi,
>
> I noticed that when training with unicharset compression (train_mode)training
> instances are used sequentially from one lstmf training file. This causes a
> local model convergence (for the current training font), whereas other
> fonts (training instances) were not used for training at all. Shuffling is
> possible with train_mode 16 but in my case I have to use unicharset
> compression which is  train_mode 64.
>
>
> Is it possible to use unicharset compression and to shuffle training
> instances at the same time?
>
>
> Many thanks in advance.
>
> Cheers,
> Alex
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/f70b622a-2daa-43ab-ab08-c5f09d272030%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVHwBmyp-1ChHy5we26VqwTpQ6tO-og7xmcHbMo5jb-0g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Training Tesseract 4.0 for Nepali Language

2017-05-10 Thread ShreeDevi Kumar
make a collection of unicode devanagari fonts - look at fonts.google.com

make a large training text with nepali text

review and improve the wordlist in tesseract-ocr/langdata for nepali

I will share my modified training scripts, which use small sections of the
large training text for each font.

Please note that so far I have not had success in improving the accuracy of
hindi traineddata with my experiments.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

2017-05-10 22:07 GMT+05:30 ShreeDevi Kumar <shreesh...@gmail.com>:

> see
>
> https://github.com/tesseract-ocr/langdata/tree/master/nep
>
> http://crubadan.org/languages/ne
>
> https://ne.wikipedia.org/wiki/%E0%A4%AE%E0%A5%81%E0%A4%96%
> E0%A5%8D%E0%A4%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> 2017-05-10 20:51 GMT+05:30 Nirajan Pant <niraja...@gmail.com>:
>
>> Thank you @shree. Can you help in how to generate langdata for training
>> Tesseract 4.0?
>>
>> On Wednesday, 10 May 2017 17:25:56 UTC+5:45, shree wrote:
>>>
>>> Please open an issue in langdata repo with any specific errors that you
>>> see for Nepali. Take a look at the wordlist and training_text,
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> 2017-05-10 17:03 GMT+05:30 Nirajan Pant <nira...@gmail.com>:
>>>
>>>> Yeah! I got the same result as yours with hin.traineddata which is
>>>> better than nep.traineddata. I think the langdata need some revisions. I
>>>> have attached the ground truth text for the image.
>>>>
>>>>
>>>>
>>>> On Tuesday, 9 May 2017 22:38:25 UTC+5:45, shree wrote:
>>>>>
>>>>> Attached is the output I get with
>>>>>
>>>>> tesseract nep_text_11.png nep_text_11 --oem 1 --psm 6 -l hin
>>>>>
>>>>>
>>>>> ShreeDevi
>>>>> 
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>>> 2017-05-09 21:11 GMT+05:30 ShreeDevi Kumar <shree...@gmail.com>:
>>>>>
>>>>>> Thanks. Please provide the 'ground truth' ie the original accurate
>>>>>> text for the image.
>>>>>>
>>>>>> Have tried to OCR the same image with options
>>>>>>
>>>>>> --oem 1 --PSM 6 -l hin
>>>>>>
>>>>>> Sometimes hindi traineddata gives better results.
>>>>>>
>>>>>> On May 9, 2017 9:05 PM, "Nirajan Pant" <nira...@gmail.com> wrote:
>>>>>>
>>>>>>> Here is a sample image:
>>>>>>>
>>>>>>>
>>>>>>> <https://lh3.googleusercontent.com/-4WrfbKY7lFk/WRHhTrz5F-I/DOU/drzKr-Gl1E4MHjhCErwiH_BnYe1CPk8XQCLcB/s1600/nep_text_11.png>
>>>>>>>
>>>>>>> And the result is:
>>>>>>>
>>>>>>> त्यसपछि कसरी उ इजरायल प्रवेश गर्यो,, घर बनायो ? जागीर खायो? उफ~~ सबै
>>>>>>> बिर्सिइयो , आफ्नै जीवनकथा सिलसिला मिलाएर
>>>>>>> सम्हानसकोक्षमत्तापत्तिफ्लिअबउसमा|स्वारणशक्तिक्षीणहुदेंगएकोछ्
>>>>>>> दुकौंपन्निदुत्सकोस्पष्टहेक्कारह्दैन।
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> मन्दिर जाने बाटो र प्रार्थनाका एक दुइ ऋचा मन्त्रहरु बाहेक उसको
>>>>>>> सम्झनामा सबै कुरा अधुरा छन । दिनभरिको अधिकांश समय यिनै
>>>>>>> कुरामा सिमित गर्दै आएको यो बुढो मान्छे संग कति खुसिका क्षणहरु होलान,
>>>>>>> कति संघर्ष वा दुखका कहानीहरु होलान ? म बारम्बार
>>>>>>> सोध्ने यत्न गर्छु, उ मुस्काई मात्र रहन्छ ।
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> आज त त्यो मुस्कान पनि उसले बिर्से जस्तो छ, घरिघरि एक्लै बर्बराएको
>>>>>>> सुन्छु " हे भगवान, कति एक्लो जीवन !"
>>>>>>>
>>>>>>>
>>>>>>> एक कप तातो कफी पिई सकेपछि बल्ल् अने मुखबाट उठेको बाफ पर पर फ्याक्दै
>>>>>>> उ प्रश्न 

Re: [tesseract-ocr] Training Tesseract 4.0 for Nepali Language

2017-05-10 Thread ShreeDevi Kumar
Please open an issue in langdata repo with any specific errors that you see
for Nepali. Take a look at the wordlist and training_text,

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

2017-05-10 17:03 GMT+05:30 Nirajan Pant <niraja...@gmail.com>:

> Yeah! I got the same result as yours with hin.traineddata which is better
> than nep.traineddata. I think the langdata need some revisions. I have
> attached the ground truth text for the image.
>
>
>
> On Tuesday, 9 May 2017 22:38:25 UTC+5:45, shree wrote:
>>
>> Attached is the output I get with
>>
>> tesseract nep_text_11.png nep_text_11 --oem 1 --psm 6 -l hin
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> 2017-05-09 21:11 GMT+05:30 ShreeDevi Kumar <shree...@gmail.com>:
>>
>>> Thanks. Please provide the 'ground truth' ie the original accurate text
>>> for the image.
>>>
>>> Have tried to OCR the same image with options
>>>
>>> --oem 1 --PSM 6 -l hin
>>>
>>> Sometimes hindi traineddata gives better results.
>>>
>>> On May 9, 2017 9:05 PM, "Nirajan Pant" <nira...@gmail.com> wrote:
>>>
>>>> Here is a sample image:
>>>>
>>>>
>>>> <https://lh3.googleusercontent.com/-4WrfbKY7lFk/WRHhTrz5F-I/DOU/drzKr-Gl1E4MHjhCErwiH_BnYe1CPk8XQCLcB/s1600/nep_text_11.png>
>>>>
>>>> And the result is:
>>>>
>>>> त्यसपछि कसरी उ इजरायल प्रवेश गर्यो,, घर बनायो ? जागीर खायो? उफ~~ सबै
>>>> बिर्सिइयो , आफ्नै जीवनकथा सिलसिला मिलाएर
>>>> सम्हानसकोक्षमत्तापत्तिफ्लिअबउसमा|स्वारणशक्तिक्षीणहुदेंगएकोछ्
>>>> दुकौंपन्निदुत्सकोस्पष्टहेक्कारह्दैन।
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> मन्दिर जाने बाटो र प्रार्थनाका एक दुइ ऋचा मन्त्रहरु बाहेक उसको सम्झनामा
>>>> सबै कुरा अधुरा छन । दिनभरिको अधिकांश समय यिनै
>>>> कुरामा सिमित गर्दै आएको यो बुढो मान्छे संग कति खुसिका क्षणहरु होलान,
>>>> कति संघर्ष वा दुखका कहानीहरु होलान ? म बारम्बार
>>>> सोध्ने यत्न गर्छु, उ मुस्काई मात्र रहन्छ ।
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> आज त त्यो मुस्कान पनि उसले बिर्से जस्तो छ, घरिघरि एक्लै बर्बराएको
>>>> सुन्छु " हे भगवान, कति एक्लो जीवन !"
>>>>
>>>>
>>>> एक कप तातो कफी पिई सकेपछि बल्ल् अने मुखबाट उठेको बाफ पर पर फ्याक्दै उ
>>>> प्रश्न गर्छ -
>>>> 'म्झिचकोबिषयमाकतिलेखिइन्यग्यौत?पुस्तककहिलेतय1रहुनात्तिम्रो?"
>>>> किबुच एक प्रकारको सामुदायिक विकासको अवधारणा हो, इजरायलमा यसको उदाहरणीय
>>>> र अनुकरणीय प्रयोग भएको छ |
>>>>
>>>>
>>>> "अहँ आधा पनि सकेको छैन, यस्ता खाले पुस्तकको हाम्रो देशमा खासै महत्व या
>>>> उपयोगिता होला जस्तो पनि लाग्दैन । त्यसैले यी
>>>>
>>>>
>>>> अहिले त कथा पो लेखन थालेको छु, फेसबुकतिर टाँस्दिन्छु , एक दुइ जनाले
>>>> पढ्छन पनि।"
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> मेरो नजीक आएर अन्छ उ, त्यसो भए आज के लेख्यौँ त, सुनाउन त ?
>>>>
>>>>
>>>>
>>>>
>>>> On Tuesday, 9 May 2017 12:54:31 UTC+5:45, shree wrote:
>>>>>
>>>>> Please provide sample of 'not giving good results' and samples of
>>>>> lines not being recognized correctly. Images and ground truth files will 
>>>>> be
>>>>> helpful.
>>>>>
>>>>> ShreeDevi
>>>>> 
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>>> On Tue, May 9, 2017 at 12:16 PM, Nirajan Pant <nira...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> The trainned data provided here
>>>>>> <https://github.com/tesseract-ocr/tessdata> is not giving good
>>>>>> results with Nepali text image documents. It is unable to recognize some
>>>>>> lines correctly. Can anybody help me in re-training Tesseract 4.0 for
>>>>>> Nepali language.
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.

Re: [tesseract-ocr] How to append eng.traindata with new font. ?

2017-05-09 Thread ShreeDevi Kumar
try option for multiple languages

-l eng+

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 9, 2017 at 9:47 PM,  wrote:

> Hi Community,
>
> Can someone please tell me how to append my custom font and training data
> to existing eng.traineddata file ?
>
> Thanks,
> Dinesh
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/7d5db760-6d59-4459-9cf2-daf4edff925e%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV0s1ieNegOXSFSC%2BjLs13eXDeAqJ9vaV-awhiw5EN2Kw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Training Tesseract 4.0 for Nepali Language

2017-05-09 Thread ShreeDevi Kumar
Attached is the output I get with

tesseract nep_text_11.png nep_text_11 --oem 1 --psm 6 -l hin


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

2017-05-09 21:11 GMT+05:30 ShreeDevi Kumar <shreesh...@gmail.com>:

> Thanks. Please provide the 'ground truth' ie the original accurate text
> for the image.
>
> Have tried to OCR the same image with options
>
> --oem 1 --PSM 6 -l hin
>
> Sometimes hindi traineddata gives better results.
>
> On May 9, 2017 9:05 PM, "Nirajan Pant" <niraja...@gmail.com> wrote:
>
>> Here is a sample image:
>>
>>
>> <https://lh3.googleusercontent.com/-4WrfbKY7lFk/WRHhTrz5F-I/DOU/drzKr-Gl1E4MHjhCErwiH_BnYe1CPk8XQCLcB/s1600/nep_text_11.png>
>>
>> And the result is:
>>
>> त्यसपछि कसरी उ इजरायल प्रवेश गर्यो,, घर बनायो ? जागीर खायो? उफ~~ सबै
>> बिर्सिइयो , आफ्नै जीवनकथा सिलसिला मिलाएर
>> सम्हानसकोक्षमत्तापत्तिफ्लिअबउसमा|स्वारणशक्तिक्षीणहुदेंगएकोछ्
>> दुकौंपन्निदुत्सकोस्पष्टहेक्कारह्दैन।
>>
>>
>>
>>
>>
>> मन्दिर जाने बाटो र प्रार्थनाका एक दुइ ऋचा मन्त्रहरु बाहेक उसको सम्झनामा
>> सबै कुरा अधुरा छन । दिनभरिको अधिकांश समय यिनै
>> कुरामा सिमित गर्दै आएको यो बुढो मान्छे संग कति खुसिका क्षणहरु होलान, कति
>> संघर्ष वा दुखका कहानीहरु होलान ? म बारम्बार
>> सोध्ने यत्न गर्छु, उ मुस्काई मात्र रहन्छ ।
>>
>>
>>
>>
>>
>> आज त त्यो मुस्कान पनि उसले बिर्से जस्तो छ, घरिघरि एक्लै बर्बराएको सुन्छु "
>> हे भगवान, कति एक्लो जीवन !"
>>
>>
>> एक कप तातो कफी पिई सकेपछि बल्ल् अने मुखबाट उठेको बाफ पर पर फ्याक्दै उ
>> प्रश्न गर्छ -
>> 'म्झिचकोबिषयमाकतिलेखिइन्यग्यौत?पुस्तककहिलेतय1रहुनात्तिम्रो?"
>> किबुच एक प्रकारको सामुदायिक विकासको अवधारणा हो, इजरायलमा यसको उदाहरणीय र
>> अनुकरणीय प्रयोग भएको छ |
>>
>>
>> "अहँ आधा पनि सकेको छैन, यस्ता खाले पुस्तकको हाम्रो देशमा खासै महत्व या
>> उपयोगिता होला जस्तो पनि लाग्दैन । त्यसैले यी
>>
>>
>> अहिले त कथा पो लेखन थालेको छु, फेसबुकतिर टाँस्दिन्छु , एक दुइ जनाले पढ्छन
>> पनि।"
>>
>>
>>
>>
>>
>>
>>
>>
>> मेरो नजीक आएर अन्छ उ, त्यसो भए आज के लेख्यौँ त, सुनाउन त ?
>>
>>
>>
>>
>> On Tuesday, 9 May 2017 12:54:31 UTC+5:45, shree wrote:
>>>
>>> Please provide sample of 'not giving good results' and samples of lines
>>> not being recognized correctly. Images and ground truth files will be
>>> helpful.
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Tue, May 9, 2017 at 12:16 PM, Nirajan Pant <nira...@gmail.com> wrote:
>>>
>>>> The trainned data provided here
>>>> <https://github.com/tesseract-ocr/tessdata> is not giving good results
>>>> with Nepali text image documents. It is unable to recognize some lines
>>>> correctly. Can anybody help me in re-training Tesseract 4.0 for Nepali
>>>> language.
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>>> gid/tesseract-ocr/7761b739-6f6e-4343-9039-501f7c60782c%40goo
>>>> glegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7761b739-6f6e-4343-9039-501f7c60782c%40googlegroups.com?utm_medium=email_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/57c5e038-da3b-4f94-82c4-791b858fbf42%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/57c5e038

Re: [tesseract-ocr] Training Tesseract 4.0 for Nepali Language

2017-05-09 Thread ShreeDevi Kumar
Thanks. Please provide the 'ground truth' ie the original accurate text for
the image.

Have tried to OCR the same image with options

--oem 1 --PSM 6 -l hin

Sometimes hindi traineddata gives better results.

On May 9, 2017 9:05 PM, "Nirajan Pant"  wrote:

> Here is a sample image:
>
>
> 
>
> And the result is:
>
> त्यसपछि कसरी उ इजरायल प्रवेश गर्यो,, घर बनायो ? जागीर खायो? उफ~~ सबै
> बिर्सिइयो , आफ्नै जीवनकथा सिलसिला मिलाएर
> सम्हानसकोक्षमत्तापत्तिफ्लिअबउसमा|स्वारणशक्तिक्षीणहुदें
> गएकोछ्दुकौंपन्निदुत्सकोस्पष्टहेक्कारह्दैन।
>
>
>
>
>
> मन्दिर जाने बाटो र प्रार्थनाका एक दुइ ऋचा मन्त्रहरु बाहेक उसको सम्झनामा
> सबै कुरा अधुरा छन । दिनभरिको अधिकांश समय यिनै
> कुरामा सिमित गर्दै आएको यो बुढो मान्छे संग कति खुसिका क्षणहरु होलान, कति
> संघर्ष वा दुखका कहानीहरु होलान ? म बारम्बार
> सोध्ने यत्न गर्छु, उ मुस्काई मात्र रहन्छ ।
>
>
>
>
>
> आज त त्यो मुस्कान पनि उसले बिर्से जस्तो छ, घरिघरि एक्लै बर्बराएको सुन्छु "
> हे भगवान, कति एक्लो जीवन !"
>
>
> एक कप तातो कफी पिई सकेपछि बल्ल् अने मुखबाट उठेको बाफ पर पर फ्याक्दै उ
> प्रश्न गर्छ -
> 'म्झिचकोबिषयमाकतिलेखिइन्यग्यौत?पुस्तककहिलेतय1रहुनात्तिम्रो?"
> किबुच एक प्रकारको सामुदायिक विकासको अवधारणा हो, इजरायलमा यसको उदाहरणीय र
> अनुकरणीय प्रयोग भएको छ |
>
>
> "अहँ आधा पनि सकेको छैन, यस्ता खाले पुस्तकको हाम्रो देशमा खासै महत्व या
> उपयोगिता होला जस्तो पनि लाग्दैन । त्यसैले यी
>
>
> अहिले त कथा पो लेखन थालेको छु, फेसबुकतिर टाँस्दिन्छु , एक दुइ जनाले पढ्छन
> पनि।"
>
>
>
>
>
>
>
>
> मेरो नजीक आएर अन्छ उ, त्यसो भए आज के लेख्यौँ त, सुनाउन त ?
>
>
>
>
> On Tuesday, 9 May 2017 12:54:31 UTC+5:45, shree wrote:
>>
>> Please provide sample of 'not giving good results' and samples of lines
>> not being recognized correctly. Images and ground truth files will be
>> helpful.
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, May 9, 2017 at 12:16 PM, Nirajan Pant  wrote:
>>
>>> The trainned data provided here
>>>  is not giving good results
>>> with Nepali text image documents. It is unable to recognize some lines
>>> correctly. Can anybody help me in re-training Tesseract 4.0 for Nepali
>>> language.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/7761b739-6f6e-4343-9039-501f7c60782c%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/57c5e038-da3b-4f94-82c4-791b858fbf42%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXbqS8EVi5h6uWWqSa81bV3RAkc5OuYvAMROnFW79OQng%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Tesseract 4.0 Neural Network

2017-05-09 Thread ShreeDevi Kumar
https://github.com/tesseract-ocr/tesseract/wiki/Compiling

master branch on github is for 4.0.0alpha

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 9, 2017 at 7:35 PM, sfo  wrote:

> Hello can you please help me to install tessract 4.0 from source?
>
> Le mardi 18 avril 2017 10:22:34 UTC+2, Kranthi Kiran a écrit :
>>
>> Hello community,
>> I would like to know how to use the neural network feature to recognise
>> my text.
>> I read Tesseract uses Adaptive Classifier by default. How do I enable
>> Neural networks?
>> I have installed Tesseract from source.
>>
>> Also, is there any comparative study I can find showing improved
>> performance of
>> Tesseract 4.0's neural network over the previous approaches.
>>
>> Thank you,
>> Kranthi Kiran GV,
>> 3rd yr CS undergrad,
>> NIT Warangal
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/6b0b7113-1fee-4b58-acf9-3bf298bb586c%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX48V5cJAS6kq%3DqBJxFg%3DU8rfzJYXZtbbqHWsJ_qK5%3DqA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] tesseract 4.0 documentation

2017-05-09 Thread ShreeDevi Kumar
https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 9, 2017 at 7:29 PM, sfo  wrote:

> hello! where can i find tesseract 4.0 alpha documentation?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/62b57e65-6f4d-448e-9996-3cd7eeb64943%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWZhq3GEdTTrTYZEeQm1PmVb-yfym-3jjRbXhLtrmKbRQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to automatically generate .box files when using tesstrain.sh?

2017-05-09 Thread ShreeDevi Kumar
Box files are generated after the tif. The script works on 8 fonts at a
time.

ls -l /tmp/tmp.Vu25eURnxk/eng/*.*

will show you all generated files.




ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 9, 2017 at 1:25 PM, Sraw Sraw  wrote:

> When using tesstrain.sh, it will automatically generate .tif image files
> but not .box files. What's the use of it? So I  still need to manually
> generate .box file for every .tif?
>
>
> Rendered page 0 to file /tmp/tmp.Vu25eURnxk/eng/eng.Ce
> ntury_Schoolbook_L_Medium.exp0.tif
> Rendered page 0 to file /tmp/tmp.Vu25eURnxk/eng/eng.URW_Bookman_L_Bold.
> exp0.tif
> Rendered page 0 to file /tmp/tmp.Vu25eURnxk/eng/eng.URW_Bookman_L_Italic.
> exp0.tif
> Rendered page 0 to file /tmp/tmp.Vu25eURnxk/eng/eng.UR
> W_Bookman_L_Bold_Italic.exp0.tif
> Rendered page 0 to file /tmp/tmp.Vu25eURnxk/eng/eng.Ce
> ntury_Schoolbook_L_Bold.exp0.tif
> Rendered page 0 to file /tmp/tmp.Vu25eURnxk/eng/eng.DejaVu_Sans_Ultra-
> Light.exp0.tif
> Rendered page 0 to file /tmp/tmp.Vu25eURnxk/eng/eng.Ce
> ntury_Schoolbook_L_Italic.exp0.tif
> Rendered page 0 to file /tmp/tmp.Vu25eURnxk/eng/eng.Ce
> ntury_Schoolbook_L_Bold_Italic.exp0.tif
> ERROR: /tmp/tmp.Vu25eURnxk/eng/eng.Arial_Bold.exp0.box does not exist or
> is not readable
>
> The error output is showing above, and I have checked that .tif files
> exactly exist.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/ms
> gid/tesseract-ocr/6a9e2525-c534-45ae-8a1f-e62009642447%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUk-pDpFPboxF3yYjFMB-4p0fYAQ3S88BvRe0WO_-Sm3A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Training Tesseract 4.0 for Nepali Language

2017-05-09 Thread ShreeDevi Kumar
see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
for info about training.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 9, 2017 at 12:38 PM, ShreeDevi Kumar <shreesh...@gmail.com>
wrote:

> Please provide sample of 'not giving good results' and samples of lines
> not being recognized correctly. Images and ground truth files will be
> helpful.
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Tue, May 9, 2017 at 12:16 PM, Nirajan Pant <niraja...@gmail.com> wrote:
>
>> The trainned data provided here
>> <https://github.com/tesseract-ocr/tessdata> is not giving good results
>> with Nepali text image documents. It is unable to recognize some lines
>> correctly. Can anybody help me in re-training Tesseract 4.0 for Nepali
>> language.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/7761b739-6f6e-4343-9039-501f7c60782c%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/7761b739-6f6e-4343-9039-501f7c60782c%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVH21mWsQr%3DpwoG6ux2Nzvy3pTDRGhfyVOGTWU0yeEfuw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Training Tesseract 4.0 for Nepali Language

2017-05-09 Thread ShreeDevi Kumar
Please provide sample of 'not giving good results' and samples of lines not
being recognized correctly. Images and ground truth files will be helpful.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 9, 2017 at 12:16 PM, Nirajan Pant  wrote:

> The trainned data provided here
>  is not giving good results
> with Nepali text image documents. It is unable to recognize some lines
> correctly. Can anybody help me in re-training Tesseract 4.0 for Nepali
> language.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/7761b739-6f6e-4343-9039-501f7c60782c%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWYtG0g-%2Byb1ms%2BCKP0kNXt4-ekSQetfitC%3D%2B7zFGy2wg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: got "undefined symbol omp_get_thread_num" while try example "extracting orientation from Tesseract 4.0"

2017-05-07 Thread ShreeDevi Kumar
Most probably the API example has not been updated for tesseract 4.

There have been many changes -
Please see https://abi-laboratory.pro/tracker/timeline/tesseract/

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, May 7, 2017 at 5:36 PM, Mike Zang  wrote:

> I update gcc to 4.8.5 to support c++ 11 before compile and install
> tesseract. I tried api example with 4.0 failed in this page -
> https://github.com/tesseract-ocr/tesseract/wiki/APIExample#c-api-in-python
> .
> OS is Centos 6,  I compile and install tessearct exactly the way as the
> wiki page writes,  and I tried reinstall cffi lib of python which didn't
> work. Do I need to recompile my python with new gcc4.8 or something
> alse?
>
> Somebody help!
>
>
>
> 在 2017年5月6日星期六 UTC+8下午11:40:56,Mike Zang写道:
>>
>> Got "undefined symbol omp_get_thread_num" while try API example
>> "extracting orientation from Tesseract 4.0" with python and
>> tesseract-master.
>>
>> Help!
>>
>> Ok using tesseract command line.
>> Does that mean there is some problem with my python (both 2.6 and 3.4 not
>> work) or some other environment issues?
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/4fedfab4-9edd-4eb0-b3ce-ab3ef8ec4f83%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVeeMaFBXmiKxDxHZk-b--W3ZaswUD-ZunOY-k-Zijp_Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Fine tuning with existing box/tiff pairs in Tesseract 4.0

2017-05-06 Thread ShreeDevi Kumar
When using pre-existing box tiff pairs, you have to add a box with tab
character to mark end of line and also add boxes with spaces after every
word.

You then need to generate the .lstmf files - please
see training/tesstrain.sh for details.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, May 6, 2017 at 4:40 PM, bmwmine  wrote:

> you are missing the .lstmf files
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/47875785-3322-4d5d-89fd-1818c2c06bc2%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWDVvNYAEpT_VukjqSKcb96zY%3DpEA4gSxgUVfzb%2BXKCnQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Wrong or missing Segmentation of Words

2017-05-04 Thread ShreeDevi Kumar
Please provide your original image for testing. Thanks!

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, May 4, 2017 at 5:36 PM, 'Thomas Zipproth' via tesseract-ocr <
tesseract-ocr@googlegroups.com> wrote:

> We tried to read english documents with the C++ API (Tesseract 3 and 4),
> but in most cases a lot of words are missing or the word rectangles are
> completely wrong.
>
> In the attached example, you can see the missing words and a red marked
> wrong rectangle.
> I tried different page segmentation methods and other parameters, but
> without success.
>
> The document resolution is 300 dpi, it is a bit resized (smaller).
>
>
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/27b90c78-57ca-4361-9258-120b1ff99c9b%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUV6Z5FKFVQD9yQ7Y368GgTA04BO-ZvR0WHq8B983MUFg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: How to make training for Arabic in Tesseract 4.0

2017-05-04 Thread ShreeDevi Kumar
Ibr,

You are incorrect in your description of LSTM training.

What you are doing will use the ara.traineddata provided in the repo, there
will be no change in output.

Once lstmf files are created, you have to run lstmtraining which will run
for days/weeks  to give you a good result.

Please read about LSTM training on wiki.

On May 4, 2017 2:58 PM, "Ibr"  wrote:

> if you are referring to tesseract 4.00alpha with liptonica 1.74.1, and if
> you compiled them in the correct way and got the binaries that you need for
> training lmstf files, then I recommend to follow the suggestions that is
> made by tesseract devs which is: once you create an .lstmf file for a
> certain font (that can be used for Arabic writing) then get the official
> ara.traineddata file from GitHub paste it in tessdata folder, and the lstmf
> file in tesseract folder and run the command  tesseract text_image
> result_text -l ara --oem 1
> what Arabic characters exactly are you trying to enhance the accuracy for ?
>
> On Saturday, April 8, 2017 at 11:52:25 AM UTC+3, Ahmad Moawad wrote:
>
>> Hello All,
>>
>>
>> I want to make training for Arabic language in Tesseract 4.0, and The
>> result of this version is great but still need some tunning, so I got
>> jTessBoxEditor 2.0 beta.
>> I tried to modify the incorrect characters and build ara.traineddata.
>> After copying the ara.traineddata to /usr/share/tesseract-ocr/4.00/tessdata,
>> I got random characters when I run the tesseract on the image.
>> So any suggestion of how making training for Version 4.0, I already know
>> that that last version 3.0x cube doesn't included in 4.0 LSTM or waiting
>> until Ray makes another updated ara.traineddata.
>>
>> ,Thanks.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/1c842b1e-1dc1-418b-a5b7-368c11e7dfa5%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXHFwb4uvN_nT%2BRnepV%3DDbyc7HEpyGNOZL79O_2EbyKUA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Converting Handwritten image to text format

2017-05-02 Thread ShreeDevi Kumar
tesseract is not meant for OCR of handwriting.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 2, 2017 at 1:02 PM, Jaya Kumar  wrote:

> Hi ,
>
> I have a image document and I am trying to convert into text file using
> tesseract command line tool.
>
> I am using the below command to convert  image into text file.
>
> *tesseract.exe input.png out.txt*
>
> When I open the output file, it is the empty text file created with no
> values. the image values is not converted into text.
>
> What could be the problem? , any one please help me?
>
>
> Thanks
> Jay,
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/870747c0-f29a-498b-bb2e-f171446f97b7%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXkLkjU5MzkiTi2mT2R5OzVwQjCuqUebYNtKA29E%2BmrnQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Standalone Self-contained Tesseract-OCR for Mac

2017-05-01 Thread ShreeDevi Kumar
Stefan,
Please make the mac binaries available for both 3.05 and 4.00 similar to
windows.
I noticed that you have posted the test version for standalone Tess.
Thanks!

PS: Are the Travis created binaries available for download by users?

On May 1, 2017 7:30 PM, "'Stefan Weil' via tesseract-ocr" <
tesseract-ocr@googlegroups.com> wrote:
>
> On Thursday, 24 March 2016 11:49:03 UTC+1, Peter Reid wrote:
>>
>> I have a standalone version of tesseract-ocr for Windows that can be run
from a folder located anywhere in the Windows filing system without having
to do an installation.  For the Mac the user has to install
HomeBrew/MacPort first and then tesseract-ocr afterwards.
>
>
>
> Building Tesseract with HomeBrew or MacPorts is much easier than with
your script, and it simply works. End users who want to run Tesseract don't
need HomeBrew or MacPorts. They only need some libraries which can be
copied and distributed with the tesseract executable.
>
> --
> You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/831fc251-3afe-4cc3-b039-8ee34b000e07%40googlegroups.com
.
>
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVobSxqh-AssQkd7vk%2BV0zhZ0puNS44oQHNSGxcdP9LaQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] not reading the image properly in tesseract OCR

2017-04-27 Thread ShreeDevi Kumar
See https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage

- excuse the brevity, sent from mobile

On 27-Apr-2017 9:04 PM, "ShreeDevi Kumar" <shreesh...@gmail.com> wrote:

> tesseract output is plain text only, you will not get rich text with fonts
> etc.
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Thu, Apr 27, 2017 at 7:25 PM, Jaya Kumar <jayakumar...@gmail.com>
> wrote:
>
>> Hi
>> I am having hand typewritten image and trying to convert as text using
>> tesseract command , but it is giving output as plain text . what could be
>> the problem?
>> tesseract.exe test.png out.txt
>> could you please help me.
>>
>> Thanks
>> Jay,
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/71947e64-675f-4753-a7c7-d13d6796bf68%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/71947e64-675f-4753-a7c7-d13d6796bf68%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV_B%3DJVRw9e_u2RDUSN%2B9%3DhHXhpRghuNjZ_n3bJ3H86Sw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] not reading the image properly in tesseract OCR

2017-04-27 Thread ShreeDevi Kumar
tesseract output is plain text only, you will not get rich text with fonts
etc.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Apr 27, 2017 at 7:25 PM, Jaya Kumar  wrote:

> Hi
> I am having hand typewritten image and trying to convert as text using
> tesseract command , but it is giving output as plain text . what could be
> the problem?
> tesseract.exe test.png out.txt
> could you please help me.
>
> Thanks
> Jay,
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/71947e64-675f-4753-a7c7-d13d6796bf68%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUhNVo%2BPBLmxUadp2OHByrFw0kCW%3DHhZQRcXg5SuOYitg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] pb install on redhat PKG_CHECK_MODULES(LEPTONICA

2017-04-25 Thread ShreeDevi Kumar
I built both from source yesterday.

Try the following for building tesseract


/autogen.sh
./configure
LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make
sudo make install
sudo ldconfig

As given in compiling page on wiki

- excuse the brevity, sent from mobile

On 25-Apr-2017 2:14 PM, "Jean-Christophe Penalva" <
jeanchristophe.pena...@gmail.com> wrote:

>   Hello,
>
>   i try to compile the new version of Tesseract (v4.0x) on a redhat from
> source. I've already compile and install the libleptonica, and now i've a
> pb during the configure stage of tesseract :
>
> export CFLAGS=-I./leptonica/1.74.1/include
> export CPPFLAGS=-I/leptonica/1.74.1/include
> export LDFLAGS=-L.../leptonica/1.74.1/lib
>
> ./configure --prefix=.../tesseract/4.0 
> --with-extra-includes=.../leptonica/1.74.1/include
> --with-extra-libraries=/leptonica/1.74.1/lib
> ...
> 
> checking for long long int... yes
> checking for off_t... yes
> checking for mbstate_t... yes
> ./configure: line 16347: syntax error near unexpected token `LEPTONICA,'
> ./configure: line 16347: `PKG_CHECK_MODULES(LEPTONICA, lept >= 1.74,
> have_lept=true, have_lept=false)'
>
> I search for a this message ... but nothing. Is it possible to install
> tesseract from sources (and from sources of leptonica too) ?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/041d7e95-7af2-4faa-80c7-2844c587b6ed%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWfMaLBAv37b0xAxZs-93d%3D-m74OTrPZPV%3Dpi7%2BED5pxg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Absolute beginner requesting help for getting started with Tesseract in C++ application.

2017-04-25 Thread ShreeDevi Kumar
See

https://github.com/tesseract-ocr/tesseract/wiki/User-App-Example

https://github.com/tesseract-ocr/tesseract/wiki/APIExample

- excuse the brevity, sent from mobile

On 25-Apr-2017 12:11 PM, "Dhairya Shah"  wrote:

> Dear All,
> I am absolute complete beginner with tesseract or use of external
> application for that case. I want to write a cpp application that uses
> tesseract api( baseapi.h and allheader.h) However I could not find any
> tutorial on the internet which tells me how to add these libraries to my
> ide(code blocks)
>
> I am using windows 64bit.
>
> The question I want to ask is:
>  1) inorder to include baseapi.h , is installing tesseract using
> windows installer enough or do I have to do something else.( I figure I
> have to download tesseract-ocr from git)?
>  2) I have sucessfully downloaded and installed leptonica, please
> guide me how to link the libraries in codeblocks.( i have tried messing
> around with settings->linker settings , but hace no success).
>
> I just want to write a small application which uses tesseract api. Please
> give me any input regarding how to do this, Using codeblocks or c++ is not
> necessary. any ide or any language would do. If you tell me how to compile
> my cpp applicaation using cmd also that would be more than helpful.
>
> Once again I am complete beginner,kindly help me out and im sorry if i
> sound stupid.
> thanks.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/4b605e03-16e4-4db5-8775-cbed6738594e%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVaOBq7GF9%2B8VD4ocMVTVef15F4F2yPCHdgp8_AXxj0zQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: issue with simple reading of numbers 9 and 8

2017-04-23 Thread ShreeDevi Kumar
James,

Were you able to get this to work for you with 3.04/3.05?

I get accurate results using Tesseract 4.0 alpha, though it takes longer
with --oem 1 than --oem 0.


./troublewith98-300.jpg
Tesseract Open Source OCR Engine v4.00.00alpha-385-gab41465 with Leptonica

real0m1.203s
user0m0.578s
sys 0m0.203s
Tesseract Open Source OCR Engine v4.00.00alpha-385-gab41465 with Leptonica

real0m4.485s
user0m5.125s
sys 0m0.234s

See attached ..

You can test with
https://sourceforge.net/projects/vietocr/files/vietocr.net/5.0alpha/
which uses Tesseract.NET (Tesseract 4.00alpha 362b68e)


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Apr 23, 2017 at 9:25 AM, ShreeDevi Kumar <shreesh...@gmail.com>
wrote:

> Try training using more samples of 8, 9, B etc.
>
> What results do you get with the provided eng.traineddata?  Are they
> better or worse?
>
> Have you tried changing DPI of image to 300?
>
> - excuse the brevity, sent from mobile
>
> On 22-Apr-2017 10:29 PM, "James Abney" <abne...@gmail.com> wrote:
>
>> Oh yes I guess I forgot to include that information, I did train using
>> only that font and with the same size font. I am on windows 7 and I used
>> 3.05 to train, although the .net wrapper i use is 3.04. I don't see how it
>> has difficulty with the 9 and 8, seems very odd.
>>
>> On Friday, April 21, 2017 at 11:05:49 PM UTC-4, shree wrote:
>>>
>>> Which version of Tesseract. Which o/s?
>>>
>>> If all your text is in tungsten-semibold, have you tried training with
>>> just that font?
>>>
>>> - excuse the brevity, sent from mobile
>>>
>>>
>>> On 22-Apr-2017 12:50 AM, "James Abney" <abn...@gmail.com> wrote:
>>>
>>> The font is tungsten semibold
>>>
>>>
>>> On Friday, April 21, 2017 at 2:08:53 PM UTC-4, James Abney wrote:
>>>>
>>>> I'm having issues with tesseract dealing with the number 9 and 8
>>>> especially when they are next to each other. This is really the only issue
>>>> I have. Even when ocr a tiff file it shows 123456789 as 123456788. I will
>>>> link an example. Any help is appreciated. The following image is an example
>>>> where my software using tesseract interprets the 899B8993B as 8-838.
>>>>
>>>>
>>>> <https://lh3.googleusercontent.com/-HF3RzbqMD6I/WPo8RYC6GaI/AJg/phkq6dgtvSE5f3upJQrfowEp1vyW8TQXwCLcB/s1600/troublewith98.png>
>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/4a0c2a52-3eb5-4884-9371-111a6fbea73b%40goo
>>> glegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/4a0c2a52-3eb5-4884-9371-111a6fbea73b%40googlegroups.com?utm_medium=email_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/414a0ab1-8b9a-48a6-8571-795345ac316f%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/414a0ab1-8b9a-48a6-8571-795345ac316f%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVOcgryCqD77SZgHKDuJqgGCQmW9U9zFdgOoG8HT%2BHK3Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
B-s38

899B8993B

 

B-838

899889938

 

899B8993B

899889938



Re: [tesseract-ocr] Re: issue with simple reading of numbers 9 and 8

2017-04-22 Thread ShreeDevi Kumar
Try training using more samples of 8, 9, B etc.

What results do you get with the provided eng.traineddata?  Are they better
or worse?

Have you tried changing DPI of image to 300?

- excuse the brevity, sent from mobile

On 22-Apr-2017 10:29 PM, "James Abney"  wrote:

> Oh yes I guess I forgot to include that information, I did train using
> only that font and with the same size font. I am on windows 7 and I used
> 3.05 to train, although the .net wrapper i use is 3.04. I don't see how it
> has difficulty with the 9 and 8, seems very odd.
>
> On Friday, April 21, 2017 at 11:05:49 PM UTC-4, shree wrote:
>>
>> Which version of Tesseract. Which o/s?
>>
>> If all your text is in tungsten-semibold, have you tried training with
>> just that font?
>>
>> - excuse the brevity, sent from mobile
>>
>>
>> On 22-Apr-2017 12:50 AM, "James Abney"  wrote:
>>
>> The font is tungsten semibold
>>
>>
>> On Friday, April 21, 2017 at 2:08:53 PM UTC-4, James Abney wrote:
>>>
>>> I'm having issues with tesseract dealing with the number 9 and 8
>>> especially when they are next to each other. This is really the only issue
>>> I have. Even when ocr a tiff file it shows 123456789 as 123456788. I will
>>> link an example. Any help is appreciated. The following image is an example
>>> where my software using tesseract interprets the 899B8993B as 8-838.
>>>
>>>
>>> 
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-oc...@googlegroups.com.
>> To post to this group, send email to tesser...@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/4a0c2a52-3eb5-4884-9371-111a6fbea73b%40googlegroups.com
>> 
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/414a0ab1-8b9a-48a6-8571-795345ac316f%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVC%3DTw0CjKNF7aNE%3DkQN-T_-U879u9NsMRZivFKmXL5jA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: issue with simple reading of numbers 9 and 8

2017-04-21 Thread ShreeDevi Kumar
Which version of Tesseract. Which o/s?

If all your text is in tungsten-semibold, have you tried training with just
that font?

- excuse the brevity, sent from mobile


On 22-Apr-2017 12:50 AM, "James Abney"  wrote:

The font is tungsten semibold


On Friday, April 21, 2017 at 2:08:53 PM UTC-4, James Abney wrote:
>
> I'm having issues with tesseract dealing with the number 9 and 8
> especially when they are next to each other. This is really the only issue
> I have. Even when ocr a tiff file it shows 123456789 as 123456788. I will
> link an example. Any help is appreciated. The following image is an example
> where my software using tesseract interprets the 899B8993B as 8-838.
>
>
> 
>
> --
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/
msgid/tesseract-ocr/4a0c2a52-3eb5-4884-9371-111a6fbea73b%40googlegroups.com

.

For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWUnv6MSrujgzZwq9kkWpJXom4Rc1sHzBbvKF-s8-ZC%3Dw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2017-04-21 Thread ShreeDevi Kumar
If you want to OCR an invoice like the sample you posted, just use the
eng.traineddata and OCR the page. You do not need to do any training.

Here is the output I get



8633 0410 NO RP 11 07122015 NYNN 01 01 0001 Page 2 Of 3


Did you know?


Your Comcast Business Internet

service gives you access to millions

of WiFi hotspots with the fastest WiFi

and even more coverage. Find out

more at businesscomcast.com/wifi.



Need help? We’re here for you.


9 Visit business.comcast.com/help

Call 1-800—391 -3000

A


Billing support

Open 6 am-9 pm MTN, Mon through Fri

and 7 am—8 pm Sat


Technical support

Open 24 hours, 7 days a week



Did you know?


Never miss a payment with text alerts.

Receive text message reminders when your

bill is ready to pay or past due. Sign up at

business.comcast.com/myaccount.



Your bill is ready




Please notify us immediately with any

questions regarding charges billed to your

account. Comcast will issue a credit or

refund for any verified billing error which is

brought to our attention within sixty (60) days

of the bill.


ll


Additional payment options Moving? Let us help.


Automatic payment

Sign up at business.comcast.com/myaccount


a Oniine


Visit business.comcast.com/myaccount


a By phone

Call 1-800-391 -3000


if you're moving, give us as much

advanced notice as possible so we

can help make a smooth transition.


Call 1 -800-391 -3000


|||ll




ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Apr 21, 2017 at 11:34 AM, Alain Ghawi  wrote:

> Hello all,
>
> I am surprised by how many people tell me that tesseract is the best
> open-source OCR tool but yet there is no video explaining step-by-step the
> problems that you can encounter, or a good explanation and documentation
> for OCR.
>
> Well even though, everyone loves challenges! So here's the challenge I
> faced. I brought many pdf files that are invoices and I want to train
> tesseract to be able to ocr them as scanned images.
> So first of all, I transformed these pdf files into tif files
> using: magick -density 300 -depth 4   2151.pdf -background white -fill
> white -alpha Off  2151%d.tif
> This is ImageMagick. Nothing important here other than we have a 300 dpi
> image with an alpha channel off.
>
> You must rename them so : rename .tif files to:
> [lang].[name_font].exp0.tif (com.test_font.exp0.tif) This is for my example
>
> Great! After this step you must create your box file right? So I simply
> called:
> tesseract com.test_font.exp0.tif com.test_font.exp0 batch.nochop makebox
> tesseract com.test_font.exp0.tif com.test_font.exp1 batch.nochop makebox
>
> Then I fixed my files with CowBoxEditor as I wasn't finding the famous
> jTessBoxEditor online (weird right?) which did the job.
>
> After that, I created my .tr files:
> tesseract com.test_font.exp0.tif com.test_font.exp0 nobatch box.train
> tesseract com.test_font.exp1.tif com.test_font.exp1 nobatch box.train
>
> And here comes the surprises!!!
> After having your .tr files you call unicharset_extractor.
> First question: Why the glyph metrics are all 0,255,0,255,0,0,0,0,0,0?
> Which is wrong according to the documentation: https://github.
> com/tesseract-ocr/tesseract/blob/a3ba11b030345d32829b1e8355afea
> 5419978d82/doc/unicharset.5.asc
> Second question: Should I write a box file, then the other or combine
> them? Option 1: unicharset_extractor com.test_font.exp0.box   or Option 2:
> unicharset_extractor com.test_font.exp0.box com.test_font.exp1.box
> Third question: set_unicharset_extractor why should I use it? It doesn't
> fix the metrics only specify if Latin or Common! Link: https://github.com/
> tesseract-ocr/tesseract/issues/318
>
> After all these unanswered questions, I used mftraining and cntraining (no
> problems). Finally, I renamed my inttemp, normproto, pffmtable, shapetable
>  and I combined them using combine_tessdata com.
>
> Final question: If I named com.inttemp1 com.inttemp2 does it work? Same
> for shapetable, normproto, pffmtable
>
> I think these questions are asked more than once by all new users to
> tesseract. Please if any expert in tesseract can answer these questions it
> will be a great help for all the community.
> Kindly find the attached 2 tif files and the boxes generated.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%
> 40googlegroups.com
> 

Re: [tesseract-ocr] Re: Tesseract Installation

2017-04-19 Thread ShreeDevi Kumar
You can check that these are installed by entering the following

which text2image

The above will show u the location it is installed

If you don't have  training tools, you will need to build them separately -
see https://github.com/tesseract-ocr/tesseract/wiki/Compiling

make training
sudo make training-install

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUQ_FhVaGzbibJAxKfEL0M-MMZyjTuvcLMTR13RH%2B2YMw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Standalone Self-contained Tesseract-OCR for Mac

2017-04-18 Thread ShreeDevi Kumar
I haven't built 3.05 so cannot help. I would suggest that you try with
older commits of tesseract 3.05 branch to see which one works.

Hope that those who have built 3.05 on mac will help.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW9LQh6tyE-UANqtV%2B%2Bh%2BBNKsauitXR8R-BacHu52xhTw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Standalone Self-contained Tesseract-OCR for Mac

2017-04-18 Thread ShreeDevi Kumar
Please see https://github.com/tesseract-ocr/tesseract/wiki/Compiling


If you are building tesseract 4.0, you need Lept 1.74

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Apr 18, 2017 at 2:25 PM, Peter Reid  wrote:

> Hi ShreeDevi
>
> I have tried the latest version of Leptonica but I get numerous warnings
> (38 of them, mainly about implicit function definitions) and a fatal error
> 'endian.h' not found.  The build finishes saying that Leptonica has been
> built OK and its library appears in the lib folder.  However, when I try to
> build Tesseract, I get the following error:
>
> checking for leptonica... yes
> checking for pixCreate in -llept... no
> configure: error: leptonica library missing
> Configuration done, now Building
> make: Nothing to be done for `install'.
> Tesseract build failed. Exiting.
>
> So I'm not better off with the latest version.  At least with version 1.73
> I don't get the warnings and error messages when building Leptonica even
> though the Tesseract build fails.
>
> Thanks
>
> Peter
>
>
> On Thursday, March 24, 2016 at 10:49:03 AM UTC, Peter Reid wrote:
>>
>> I have a standalone version of tesseract-ocr for Windows that can be run
>> from a folder located anywhere in the Windows filing system without having
>> to do an installation.  For the Mac the user has to install
>> HomeBrew/MacPort first and then tesseract-ocr afterwards.  This fixes
>> tesseract-ocr to particular parts of the OS X filing system, preventing it
>> from being relocated and used elsewhere on the Mac.
>>
>> I'm looking for a standalone/self-contained version of tesseract-ocr for
>> the Mac that can be located anywhere and can be run without requiring
>> installations.  Please can someone point to such a version of tesseract-ocr
>> or give instructions on how I can build one of these!
>>
>> Thanks
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/a0bdea5e-9e44-4a0e-b343-e0322fffe9c3%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUpoccueMeEsXyaHjQ8NY3n-A-QRQEjeo0HM6YezgsU8Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Standalone Self-contained Tesseract-OCR for Mac

2017-04-18 Thread ShreeDevi Kumar
Use latest version of leptonica - 1.74.1

https://github.com/DanBloomberg/leptonica

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Apr 17, 2017 at 8:18 PM, Peter Reid  wrote:

> I've done some further searching and found several versions of shell
> scripts that are supposed to generate a standalone version of Tesseract.
> However, they all fail at the last part of the process, namely building
> Tesseract itself!  The script builds the libraries for zlib (v1.2.8),
> libpng (v1.6.13), libjpeg (9b) and leptonica (v1.73), but fails with the
> following error:
>
>   checking for leptonica... yes
>   checking for pixCreate in -llept... no
>   configure: error: leptonica library missing
>
> I can't find a way to correct this!  Here's the config details that lead
> to this error:
>
> export CXXFLAGS="-I$BUILD_DIR/include -I$BUILD_DIR/include/libpng16
> -I$BUILD_DIR/include/leptonica -lpng -ljpeg -lz"
> export CPPFLAGS="-I$BUILD_DIR/include -I$BUILD_DIR/include/libpng16
> -I$BUILD_DIR/include/leptonica -lpng -ljpeg -lz"
> export LDFLAGS="-L$BUILD_DIR/lib"
> export LIBLEPT_HEADERSDIR="$BUILD_DIR/include/leptonica"
>
> ./configure --prefix=$TESSERACT_DIR --with-extra-libraries=$BUILD_DIR/lib
>
> [Note: I added the CXXFLAGS as well as the CPPFLAGS as I wasn't sure which
> was needed]
>
> I have attached the latest version of the shell script I'm using so you
> can see the context.
>
> Can anyone fix my script or tell me another way of generating a standalone
> version of Tesseract for the Mac?
>
> Thanks
>
>
> On Thursday, March 24, 2016 at 10:49:03 AM UTC, Peter Reid wrote:
>>
>> I have a standalone version of tesseract-ocr for Windows that can be run
>> from a folder located anywhere in the Windows filing system without having
>> to do an installation.  For the Mac the user has to install
>> HomeBrew/MacPort first and then tesseract-ocr afterwards.  This fixes
>> tesseract-ocr to particular parts of the OS X filing system, preventing it
>> from being relocated and used elsewhere on the Mac.
>>
>> I'm looking for a standalone/self-contained version of tesseract-ocr for
>> the Mac that can be located anywhere and can be run without requiring
>> installations.  Please can someone point to such a version of tesseract-ocr
>> or give instructions on how I can build one of these!
>>
>> Thanks
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/e6dbc1e0-1314-47e9-b76c-627db8b6afc4%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUi%3D0iawsuyf3FPfLNEw1vBFUEXj76ML2Km5N6e-aj%3Ddw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: segmentation fault with tesseract 4

2017-04-12 Thread ShreeDevi Kumar
Please open as issue, as problem related to --psm 0.

- excuse the brevity, sent from mobile

On 13-Apr-2017 9:29 AM, "Pritam Dodeja"  wrote:

> Find below - I can also ship my docker container to you if you want so you
> can see my exact setup, it's about 1.15GB
>
> Pritam
>
> On Wednesday, April 12, 2017 at 10:09:35 PM UTC-4, shree wrote:
>>
>> Which operating system - Ubuntu 16.10 Yakkety Yak on x86_64
>> Which version/commit of tesseract - top of Changelog says 2017-03-24 -
>> v4.00.00-alpha
>> How was tesseract built or - I compiled it from source
>> Where did u get the binaries
>>
>> Does it work with other psm values - yes, works with 3
>> Do you have the correct version of traineddata - tesseract --list-langs
>> works as expected, I got eng.traineddata from github, md5sum for that one
>> starts with 7af2
>>
>
>
>
>>
>> - excuse the brevity, sent from mobile
>>
>> On 12-Apr-2017 11:22 PM, "Pritam Dodeja"  wrote:
>>
>>> The command below also produces the same result ( segmentation fault )
>>>
>>> tesseract a.jpg stdout --oem 1 --psm 0 -l eng
>>>
>>> Pritam
>>>
>>> On Wednesday, April 12, 2017 at 10:56:09 AM UTC-4, shree wrote:

 See https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage

 Follow correct order of variables

   tesseract  imagename|stdin outputbase|stdout [options...] [configfile...]


 ShreeDevi
 
 भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

 On Wed, Apr 12, 2017 at 8:01 PM, Pritam Dodeja 
 wrote:

> The command was the following:
>
> tesseract -l eng --oem 1 --psm 0 a.jpg stdout
>
> As far as where it occurred exactly, I can't tell.  I have been able
> to reproduce this with multiple jpgs - let me know if you need any further
> info
>
> tesseract --version shows
>
> tesseract 4.00.00alpha
> leptonica-1.74.1
> libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.2.54 : libtiff 4.0.6 :
> zlib 1.2.8
>
> Pritam
>
> On Wednesday, April 12, 2017 at 6:00:12 AM UTC-4, srn...@gmail.com
> wrote:
>>
>> Can u tell when did you got his, means with the usage of which
>> command did ypou get this error and at at which step..?
>>
>> On Wednesday, April 12, 2017 at 12:16:54 PM UTC+5:30, Pritam Dodeja
>> wrote:
>>>
>>> Hi,
>>>
>>> I get segmentation faults when using page segmentation mode 0.  Has
>>> anyone else experienced this?
>>>
>>> Pritam
>>>
>> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/e9a62f9f-cf7
> 2-4081-8ace-695dd6e3cd53%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

 --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/1c083969-4731-4703-a35f-318b11179211%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/fa4dc4fb-3cb3-45d3-b2dc-6d43df691b36%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" 

Re: [tesseract-ocr] Re: segmentation fault with tesseract 4

2017-04-12 Thread ShreeDevi Kumar
See https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage

Follow correct order of variables

  tesseract  imagename|stdin outputbase|stdout [options...] [configfile...]


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Apr 12, 2017 at 8:01 PM, Pritam Dodeja 
wrote:

> The command was the following:
>
> tesseract -l eng --oem 1 --psm 0 a.jpg stdout
>
> As far as where it occurred exactly, I can't tell.  I have been able to
> reproduce this with multiple jpgs - let me know if you need any further info
>
> tesseract --version shows
>
> tesseract 4.00.00alpha
> leptonica-1.74.1
> libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.2.54 : libtiff 4.0.6 : zlib
> 1.2.8
>
> Pritam
>
> On Wednesday, April 12, 2017 at 6:00:12 AM UTC-4, srn...@gmail.com wrote:
>>
>> Can u tell when did you got his, means with the usage of which command
>> did ypou get this error and at at which step..?
>>
>> On Wednesday, April 12, 2017 at 12:16:54 PM UTC+5:30, Pritam Dodeja wrote:
>>>
>>> Hi,
>>>
>>> I get segmentation faults when using page segmentation mode 0.  Has
>>> anyone else experienced this?
>>>
>>> Pritam
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/e9a62f9f-cf72-4081-8ace-695dd6e3cd53%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWh9rGo8KY1C0vC4Qc%2BJfpeXtUbxfJR0k%3DFGZ9eMhNo9g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

2017-04-12 Thread ShreeDevi Kumar
Lstm training is not like legacy training. Please read the wiki pages
regarding 4.0 training. I have given all sample commands there. There are 3
different ways of training.

Read the bash scripts regarding training to know more.

tesstrain.sh with --linedata-only creates the box tiff pairs but only the
lstmf file is saved in output dir.

Without --linedata-only you will get 3.0 traineddata.

There are multiple steps to be done using the lstmf files to create the
final 4.0 traineddata.

Since you want to write a tutorial, please do your own reading and trials
first


- excuse the brevity, sent from mobile

On 12-Apr-2017 4:08 PM,  wrote:

> Sorry, I have given wrong commands for arabic. Actually i was referring to
> english.
>
> tesseract eng.arial.exp4.tif eng.arial.exp4 nobatch box.train
> unicharset_extractor eng.arial.exp4.box
> echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations
> about the font
> mftraining -F font_properties -U unicharset -O eng.unicharset eng.arial.
> exp4.tr
> shapeclustering -F unicharset eng.arial.exp4.tr
> cntraining eng.arial.exp4.tr
>
> mv inttemp eng.inttemp
> mv normproto eng.normproto
> mv pffmtable eng.pffmtable
> mv shapetable eng.shapetable
> combine_tessdata eng.
>
>
>  I request you to suggest the changes for the below commands with respect
> to tesseract 4.0 , these commands are for tess 3.0.
> Please suggest changes for the above steps. I plan to publish a rigorous
> explanative tutorial after getting overview of all the steps.
> Thank you.
>
>
>
>
>
>
> On Wednesday, April 12, 2017 at 4:04:42 PM UTC+5:30, shree wrote:
>>
>> Arabic was never trained with the legacy tesseract engine and I doubt you
>> will get any improvement over existing traineddata using cube or lstm.
>>
>> You are free to experiment and see what you come up with.
>>
>> I have pointed to the bash scripts for training. Please refer to them for
>> the correct process.
>>
>> - excuse the brevity, sent from mobile
>>
>> On 12-Apr-2017 4:00 PM,  wrote:
>>
>>> Hello shree, Thank you for your valuable reply.. Are there any changes i
>>> need to follow for the steps below.. I request you to suggest the changes
>>> for the below commands, these are for tess 3.0
>>>
>>> tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train
>>> unicharset_extractor ara.arial.exp4.box
>>> echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations
>>> about the font
>>> mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial.
>>> exp4.tr
>>> shapeclustering -F unicharset ara.arial.exp4.tr
>>> cntraining ara.arial.exp4.tr
>>>
>>> mv inttemp ara.inttemp
>>> mv normproto ara.normproto
>>> mv pffmtable ara.pffmtable
>>> mv shapetable ara.shapetable
>>> combine_tessdata ara.
>>>
>>>
>>> Please suggest changes for the above steps. I plan to publish a rigorous
>>> explanative tutorial after getting overview of all the steps.
>>> Thank you.
>>>
>>>
>>> On Wednesday, April 12, 2017 at 3:38:11 PM UTC+5:30, shree wrote:

 see https://github.com/tesseract-ocr/tesseract/blob/master/
 training/tesstrain.sh


 if ((LINEDATA)); then
   phase_E_extract_features "lstm.train" 8 "lstmf"
   make__lstmdata
 else
   phase_E_extract_features "box.train" 8 "tr"
   phase_C_cluster_prototypes "${TRAINING_DIR}/${LANG_CODE}.normproto"
   if [[ "${ENABLE_SHAPE_CLUSTERING}" == "y" ]]; then
   phase_S_cluster_shapes
   fi
   phase_M_cluster_microfeatures
   phase_B_generate_ambiguities
   make__traineddata
 fi

 

 lstm.train is for LSTM training

 box.train is for 3.0 Tesseract legacy engine training

 Please note that current master code is for alpha testing for 4.0 LSTM
 and will most probably drop support for legacy engine.

 If you want the legacy tesseract engine and train for it, please use
 the 3.05 branch of the github repo.

>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/70a9d13b-a28b-4e6f-9c78-ec1c41361d96%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this 

Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

2017-04-12 Thread ShreeDevi Kumar
Arabic was never trained with the legacy tesseract engine and I doubt you
will get any improvement over existing traineddata using cube or lstm.

You are free to experiment and see what you come up with.

I have pointed to the bash scripts for training. Please refer to them for
the correct process.

- excuse the brevity, sent from mobile

On 12-Apr-2017 4:00 PM,  wrote:

> Hello shree, Thank you for your valuable reply.. Are there any changes i
> need to follow for the steps below.. I request you to suggest the changes
> for the below commands, these are for tess 3.0
>
> tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train
> unicharset_extractor ara.arial.exp4.box
> echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations
> about the font
> mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial.
> exp4.tr
> shapeclustering -F unicharset ara.arial.exp4.tr
> cntraining ara.arial.exp4.tr
>
> mv inttemp ara.inttemp
> mv normproto ara.normproto
> mv pffmtable ara.pffmtable
> mv shapetable ara.shapetable
> combine_tessdata ara.
>
>
> Please suggest changes for the above steps. I plan to publish a rigorous
> explanative tutorial after getting overview of all the steps.
> Thank you.
>
>
> On Wednesday, April 12, 2017 at 3:38:11 PM UTC+5:30, shree wrote:
>>
>> see https://github.com/tesseract-ocr/tesseract/blob/master/
>> training/tesstrain.sh
>>
>>
>> if ((LINEDATA)); then
>>   phase_E_extract_features "lstm.train" 8 "lstmf"
>>   make__lstmdata
>> else
>>   phase_E_extract_features "box.train" 8 "tr"
>>   phase_C_cluster_prototypes "${TRAINING_DIR}/${LANG_CODE}.normproto"
>>   if [[ "${ENABLE_SHAPE_CLUSTERING}" == "y" ]]; then
>>   phase_S_cluster_shapes
>>   fi
>>   phase_M_cluster_microfeatures
>>   phase_B_generate_ambiguities
>>   make__traineddata
>> fi
>>
>> 
>>
>> lstm.train is for LSTM training
>>
>> box.train is for 3.0 Tesseract legacy engine training
>>
>> Please note that current master code is for alpha testing for 4.0 LSTM
>> and will most probably drop support for legacy engine.
>>
>> If you want the legacy tesseract engine and train for it, please use the
>> 3.05 branch of the github repo.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/70a9d13b-a28b-4e6f-9c78-ec1c41361d96%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU4vx2rg0KdYqnxUjyhgJd4W1028P9S-5kK5S5OH77G9g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Tesseract (4 alpha ) Amibiguos Situation while Correcting Chars in box file

2017-04-12 Thread ShreeDevi Kumar
You can use jtessboxeditor to edit the box files. Make sure to mark EOL if
you are trying to train using scanned images.

Also note that this part of code is untested - training 4.0 using
pre-existing images and box files.

Ray has only explained method for using images created by text2image.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Apr 12, 2017 at 3:23 PM,  wrote:

> Can you please tell me how to split box and and merge two boxes
> respectively. I am not able to find any options regarding this. If you
> specify, it will be helpful to me and others also.
>
> Thank You.
>
> On Tuesday, April 11, 2017 at 9:10:14 AM UTC+5:30, Quan Nguyen wrote:
>>
>> For Case 1, you'll need to merge the two boxes. For Case 2, you'll
>> correct by splitting the box.
>>
>> On Wednesday, April 5, 2017 at 12:55:37 AM UTC-5, srn...@gmail.com wrote:
>>>
>>> I am trying to correct box files, so i can train tesseract.
>>>
>>> But I have got strange problem,
>>>
>>>
>>> 1) Tesseract is recognizing some alphabet as two letters, then how to
>>> edit the box file then.. (screenshot 1).
>>> 2) Tesseract is not recognizing some alphabets so how to edit the box
>>> file then.. (screenshot 2).
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/66aa5a58-da85-4cfd-b030-5f1857c95754%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWTWQCseiapd71vFd-ZwX5ZcKnLNXgU%3DOr3jXWLTu%3DEhw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

2017-04-12 Thread ShreeDevi Kumar
see
https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain.sh


if ((LINEDATA)); then
  phase_E_extract_features "lstm.train" 8 "lstmf"
  make__lstmdata
else
  phase_E_extract_features "box.train" 8 "tr"
  phase_C_cluster_prototypes "${TRAINING_DIR}/${LANG_CODE}.normproto"
  if [[ "${ENABLE_SHAPE_CLUSTERING}" == "y" ]]; then
  phase_S_cluster_shapes
  fi
  phase_M_cluster_microfeatures
  phase_B_generate_ambiguities
  make__traineddata
fi



lstm.train is for LSTM training

box.train is for 3.0 Tesseract legacy engine training

Please note that current master code is for alpha testing for 4.0 LSTM and
will most probably drop support for legacy engine.

If you want the legacy tesseract engine and train for it, please use the
3.05 branch of the github repo.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUfKtJ_Dyxt1RY4_MrpBExSOqbDGi_0sX3rSZzYuKeRzg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

2017-04-12 Thread ShreeDevi Kumar
Read the bash scripts in

tesstrain.sh
tesstrain_utils.sh
language_specific.sh

In training directory

To understand more detail about lstm training

- excuse the brevity, sent from mobile

On 12-Apr-2017 10:47 AM, "Ahmad Moawad"  wrote:

> this is the part from https://github.com/tesseract-ocr/tesseract/wiki/
> TrainingTesseract-4.00
>
> My question related to the image part not making training from text
>
>
> The overall training process is similar to training 3.04
> 
> Conceptually the same:
>
>1. Prepare training text.
>
> 
>2. Render text to image + box file. (Or create hand-made box files for
>existing image data.)
>3. Make unicharset file.
>4. Optionally make dictionary data.
>5. Run tesseract to process image + box file to make training data set.
>6. Run training on training data set.
>7. Combine data files.
>
> Are the above steps similar to:
>
> tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train
> unicharset_extractor ara.arial.exp4.box
> echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations
> about the font
> mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial.
> exp4.tr
> shapeclustering -F unicharset ara.arial.exp4.tr
> cntraining ara.arial.exp4.tr
>
> mv inttemp ara.inttemp
> mv normproto ara.normproto
> mv pffmtable ara.pffmtable
> mv shapetable ara.shapetable
> combine_tessdata ara.
>
>
> Should I use these steps or not.
>
>
> The key differences are:
>
>- The boxes only need to be at the *textline level.* It is thus *far
>easier* to make training data from existing image data.
>- The .tr files are replaced by .lstmf data files.
>- Fonts *can and should be mixed freely* instead of being separate.
>- The clustering steps (mftraining, cntraining, shapeclustering) are
>replaced with a single slow lstmtraining step.
>
> for this part i don't a lot about it.
>
>
> Thanks!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/60029df8-2149-4f5a-8c3a-32e96c27ce79%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWWM_F9Epr0HQG_EU70dZRqcPFpyGOxupK93J%3DiqvS0cA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Help in TrainingTesseract 4.00 Finetune

2017-04-12 Thread ShreeDevi Kumar
--linedata-only means that it will only try to create lstmf files and not
the files for 3.0x traing

- excuse the brevity, sent from mobile

On 12-Apr-2017 10:39 AM, "Ahmad Moawad"  wrote:

> Hello All,
>
> I want help in trainingTesseract 4.00 Finetune
> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---
> Finetune
> I want to know some parameter such as:
>
> 1- langdata_dir is that the file in https://github.com/tesseract-ocr/langdata
>
> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang ara  
> --linedata_only \
>   --training_text ../langdata/ara/arabic1.txt \
>   --langdata_dir ../langdata --tessdata_dir ./tessdata \
>   --fontlist "Times New Roman," \
>   --output_dir ~/tesstutorial/aratest
> 2- lineddata_only unkown
>
> Thanks.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/7d0d9371-bbd4-4245-b415-4f67e8dfb9bb%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUyZ2Ewtm_tLFsQjjGXHd9tROvoxTrS4BNFxn8MSqhjiw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Tesseract Installation

2017-04-11 Thread ShreeDevi Kumar
Also, if you want training tools, you need to build them separately - see
https://github.com/tesseract-ocr/tesseract/wiki/Compiling

make training
sudo make training-install


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Apr 11, 2017 at 6:53 PM, shree  wrote:

>
> On Tuesday, April 11, 2017 at 4:10:26 PM UTC+5:30, Ibr wrote:
>>
>>
>> Note: I'm using windows 10 bash
>>
>
> I use it too, but via mobaxterm, which makes it easier to use
>
> see http://mobaxterm.mobatek.net/download-home-edition.html
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/b44fb5fc-7cbc-4cd0-b1b5-b50238a982fb%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUr8bUsCtb08FMfBMecoJUVT47FKyS4c_MKtViXO5CyOg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Tesseract Installation

2017-04-11 Thread ShreeDevi Kumar
You can ignore it. I get it too when using sudo 2nd time.

Host name must be the id for your computer under windows10.

Have u tried running tesseract after that?

- excuse the brevity, sent from mobile


On 11-Apr-2017 4:10 PM, "Ibr"  wrote:

Hi,

I'm trying to install the tesseract following the steps from this website

,i
ran the command for the step 5 all worked fine except the command *sudo
ldconfig *and it returned the error *sudo: unable to resolve host
DESKTOP-MEO8PSD*
Any idea, what is that error and how to solve it?
Thanks in advance
Note: I'm using windows 10 bash

-- 
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/
msgid/tesseract-ocr/4f07b521-348d-4a5f-a721-a3f40c3e998d%40googlegroups.com

.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWYwerXY8zLR_BLZK4mRcPftuy%3DAznxSKqRhEpiXYeNaQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to add Armenian language support to tesseract

2017-04-11 Thread ShreeDevi Kumar
I have added this at https://github.com/tesseract-ocr/langdata/issues/67

Please add more information there:


Which language code - arm or hye

Modern Armenian or Classical Armenian

Sources for primary texts in unicode the Armenian language to use for
training

Freely available unicode fonts to render the text


Also read
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
and
https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951

which talk about training process for 4.0 lstm.


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Apr 11, 2017 at 1:27 AM,  wrote:

> Dear all,
>
> I am trying tesseart recently and it is really a very good product. I
> would like to ask if there is any tutorial or steps about how we can add a
> new language support to the package? for example Armenian language.
>
> Thank you in advance.
>
>
> Regards,
>  Vahe
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/fa209638-0b54-4eb0-9260-6e377d3ce527%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWH5PjvPEED6D30FM1psfnpfE5Se2_K%2BRz4Pr2kYz48fg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Tesseract 4.0 doesn't see the changes after Arabic traning

2017-04-08 Thread ShreeDevi Kumar
Arabic traineddata for 3.0x uses cube engine. Training process for that was
never shared. Now the cube engine has been removed for lstm 4.0, which is
still in alpha stage.

There is 4.0alpha traineddata for Arabic and you can train for it , but
accuracy is not great. Ray is doing another training with some changes for
tatweel etc for Arabic. Depending on results, the changes will be made to
Github.

Your best bet is to wait for next set of updates from Ray/Google and try
after that.

- excuse the brevity, sent from mobile

On 08-Apr-2017 12:09 PM, "Ahmad Moawad"  wrote:

> Hello All,
>
> I want to ask about the issue that I faced after making training for
> tesseract for *Arabic*, I have Ubuntu & Tesseract-ocr 3.04
> *Steps*:
>
>1. $ convert ara.arial.exp1.jpg ara.arial.exp1.tif
>2. $ tesseract ara.arial.exp1.tif ara.arial.exp1 -l ara batch.nochop
>makebox
>3. edit the boxes using Qt Box Editor 2.0 beta
>4. $ cp ara.traineddata /usr/share/tesseract-ocr/tessdata
>5. $ tesseract ara.arial.exp1.tif out -l ara
>
> When I run the 6th one I have a bad result and the tesseract engine
> doesn't see the changes that I made it through Qt Box Editor
> Any help!!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/cbfcdb71-bd88-4eef-a39b-2a6197a56fce%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXaKAtOAidM9e_tGjH5Zmvw8QSJZE-p9%2BMeOr_ducw35A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] (Advise needed) Command Output Fails and gives error in Tesseract 4 during fine tuning

2017-04-06 Thread ShreeDevi Kumar
You must be using an old version of traineddata which does not have LSTM.

- excuse the brevity, sent from mobile

On 07-Apr-2017 2:13 AM,  wrote:

> I am following this link https://github.com/tesseract-ocr/tesseract/wiki/
> TrainingTesseract-4.00---Finetune
>
> For genaerating the files for fine tuning
>
>
> command used (for Reference):
>
>  combine_tessdata -e ../tessdata/ara.traineddata \
>   ~/tesstutorial/aratuned_from_ara/ara.lstm
>
>
> command used (actual):
>
>
> cmd : /home/p/Documents/T/tesseract-master/training/combine_tessdata -e
> /usr/share/tesseract-ocr/tessdata/eng.traineddata \
> > /home/p/Documents/T/engoutput/eng.lstm
>
> error :
>
> Extracting tessdata components from /usr/share/tesseract-ocr/
> tessdata/eng.traineddata
> Not extracting /home/plianto/Documents/Tvat/engoutput/eng.lstm, since
> this component is not present
>
>
> cmd  : /home/p/Documents/T/tesseract-master/training/combine_tessdata -e
> /usr/share/tesseract-ocr/tessdata/eng.traineddata \
>
> error:
> >/home/p/Documents/T/engoutput/eng.*
> Extracting tessdata components from /usr/share/tesseract-ocr/
> tessdata/eng.traineddata
> TessdataManager can't determine which tessdata component is represented by
> lstmf
> tesseract::TessdataManager::TessdataTypeFromFileName( filename, ,
> _file):Error:Assert failed:in file tessdatamanager.cpp, line 269
> Segmentation fault (core dumped)
>
>
>
> I dont know why I am not able to extract the files, any body pls give me
> advice
>
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/5e6402f3-0ec2-4e52-b630-afa39fe0bfd6%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXPuNceRZ2pY0v5VbCsZiie5pGZfeakbGu6UvZjFVEUew%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Read 2 column Image Horizontally (line by line) rather than Vertically (column by column)

2017-04-06 Thread ShreeDevi Kumar
Normally, for text output, the other config files should not impact.



- excuse the brevity, sent from mobile

On 07-Apr-2017 2:18 AM, "Mike Hall"  wrote:

> Yes, we are using the -psm 6 command line argument.  And it was not
> working.
>
> But I figured out the issue.
>
> Tesseract has a set of config files. Inside several of these config files
> (hocr, pdf, tsv, unlv) is the setting *tessedit_pageseg_mode*. This
> setting was set to 1 in all the config files.   Once I removed the
> *tessedit_pageseg_mode* parameter from the config files, our command line
> argument of -psm 6 worked.
>
> Alternatively, I did experiment with the config files.  When I changed the 
> *tessedit_pageseg_mode
> *setting to 6 in all the config files and ran Tesseract with the -psm 6
> command line argument, it also worked.
>
> Thanks
>
> On Thursday, April 6, 2017 at 1:12:18 PM UTC-5, shree wrote:
>
>> Have u tried --psm 6
>>
>> - excuse the brevity, sent from mobile
>>
>> On 06-Apr-2017 11:06 PM, "Mike Hall"  wrote:
>>
>>> We have a C# .Net app that is using Tesseract to do Optical Character
>>> Recognition (OCR) on .tiff files.  I've attached a sample tiff file.
>>>
>>> We are then outputting the data to a text file.  However, Tesseract is
>>> reading the data in a Vertical fashion.  In my example image, it is reading
>>> the tiff as two columns of data and the data the data is being outputted
>>> from Tesseract like this:
>>>
>>> TYPE:
>>> DATE:
>>> Address:
>>> City:
>>> State:
>>> Owner:
>>> Owner Type:
>>> Acreage:
>>> Mortgage:
>>> 12345
>>> 2017-04-06
>>> 100 Main St.
>>> Some City
>>> Some State
>>> John Doe
>>> Primary
>>> 10.25
>>> Yes
>>>
>>> What we want is Tesseract to read the tiff file horizontally and have
>>> the output look like this:
>>>
>>> TYPE:
>>> 12345
>>> DATE:
>>> 2017-04-06
>>> Address:
>>> 100 Main St.
>>> City:
>>> Some City
>>> State:
>>> Some State
>>> Owner:
>>> John Doe
>>> Owner Type:
>>> Primary
>>> Acreage:
>>> 10.25
>>> Mortgage:
>>> Yes
>>>
>>> We've tried the various Page Sementation options for Tesseract, but they
>>> all produce the same result.
>>> Has anyone run into this same issue? Anybody have any ideas?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/790b41ef-f97f-4695-b7c8-1c68bdd1cd38%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/e56e8714-716a-4664-90c0-bb0f4217c46a%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUirqMstF7ANWq9AoCy6RK7-ZGkes-yWLvGAroUH4t%2Beg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Read 2 column Image Horizontally (line by line) rather than Vertically (column by column)

2017-04-06 Thread ShreeDevi Kumar
Have u tried --psm 6

- excuse the brevity, sent from mobile

On 06-Apr-2017 11:06 PM, "Mike Hall"  wrote:

> We have a C# .Net app that is using Tesseract to do Optical Character
> Recognition (OCR) on .tiff files.  I've attached a sample tiff file.
>
> We are then outputting the data to a text file.  However, Tesseract is
> reading the data in a Vertical fashion.  In my example image, it is reading
> the tiff as two columns of data and the data the data is being outputted
> from Tesseract like this:
>
> TYPE:
> DATE:
> Address:
> City:
> State:
> Owner:
> Owner Type:
> Acreage:
> Mortgage:
> 12345
> 2017-04-06
> 100 Main St.
> Some City
> Some State
> John Doe
> Primary
> 10.25
> Yes
>
> What we want is Tesseract to read the tiff file horizontally and have the
> output look like this:
>
> TYPE:
> 12345
> DATE:
> 2017-04-06
> Address:
> 100 Main St.
> City:
> Some City
> State:
> Some State
> Owner:
> John Doe
> Owner Type:
> Primary
> Acreage:
> 10.25
> Mortgage:
> Yes
>
> We've tried the various Page Sementation options for Tesseract, but they
> all produce the same result.
> Has anyone run into this same issue? Anybody have any ideas?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/790b41ef-f97f-4695-b7c8-1c68bdd1cd38%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU8hkX3L4zxvz%3DOqf5anHM%2BOXHxf_RoGsm8xP6G69sgxw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: train tesseract OCR 4.0

2017-04-05 Thread ShreeDevi Kumar
You do not have the LSTM.train config file.

- excuse the brevity, sent from mobile

On 05-Apr-2017 1:55 PM,  wrote:

> After u have said,
>
> I tried in two ways and i am stuck at lstm step:
>
> Training
>
> command used:
>
> /home/p/Documents/T/tesseract-master/training/lstmtraining -U
> /home/p/Documents/T/img_frm_3/eng.unicharset \
> >   --script_dir /home/p/Documents/T/TESS_4_ALPHA/langdata-master
> --debug_interval 100 \
> >   --net_spec '[1,36,0,1 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256
> O1c105]' \
> >   --model_output /home/p/Documents/T/ \
> >   --train_listfile /home/p/Documents/T/img_frm_3/eng.ArialBold.exp0.txt
> \
> >   --eval_listfile /home/p/Documents/T/img_frm_3/eng.ArialBold.exp0.txt \
> >   --max_iterations 5000 &>/home/p/Documents/T/basetrain.log
>
> tail -f basetrain.log
> Error getting is :
>
>
> Deserialize header failed: BnO. 005 SUBHISHIs TOWN CENTRE
> Deserialize header failed: MOKILA SHAKARPALLY
> Deserialize header failed: PHONE: 040-8989898989
> Load of page 0 failed!
> Load of images failed!!
> Deserialize header failed: TIN: 8989898989
> Deserialize header failed: Station 1D: 01 Time: 03:26:46 PM
> Deserialize header failed: CASHIER ID:; 3001 Date: 21-02-2017
> Deserialize header failed: (null)
> Deserialize header failed: (null)
>
>
>
>
>
>
>
>
> Fine tuning:
>
> command used:-
>
> /home/plianto/Documents/Tvat/tesseract-master/training/tesstrain.sh
> --fonts_dir /usr/share/fonts --lang eng --linedata_only \
>   --training_text /home/plianto/Documents/Tvat/
> img_frm_3/eng.ArialBold.exp0.txt \
>   --langdata_dir /home/plianto/Documents/Tvat/TESS_4_ALPHA/langdata-master
> --tessdata_dir /usr/share/tesseract-ocr/tessdata \
>   --fontlist "Arial Bold" \
>   --output_dir /home/plianto/Documents/Tvat/engoutput/
>
> error:
>
> === Phase E: Generating lstmf files ===
> Using TESSDATA_PREFIX=/usr/share/tesseract-ocr/tessdata
> [Wed Apr 5 13:53:05 IST 2017] /usr/local/bin/tesseract
> /tmp/tmp.KTk3WgBTWk/eng/eng.Arial_Bold.exp0.tif
> /tmp/tmp.KTk3WgBTWk/eng/eng.Arial_Bold.exp0 lstm.train
> read_params_file: Can't open lstm.train
> Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
> Page 1
> ERROR: /tmp/tmp.KTk3WgBTWk/eng/eng.Arial_Bold.exp0.lstmf does not exist
> or is not readable
>
>
>
>
>
>
>
>
>
> On Wednesday, April 5, 2017 at 9:07:40 AM UTC+5:30, shree wrote:
>>
>> Read
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTess
>> eract-4.00---Finetune
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTess
>> eract-4.00---Replacing-Top-Layer-Example
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTess
>> eract-4.00---Replace-Top-Layer
>>
>> and
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/Documentation
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/Fonts
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/FAQ
>>
>>
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Wed, Apr 5, 2017 at 12:54 AM,  wrote:
>>
>>> Can you please post some experiences in this post, as there are no posts
>>> to train tesseract 4.
>>>
>>> 1)And also, is there any way to add the new trained data file to old
>>> trained data file, without replacing the old file.
>>> 2)If we dont know what font we may get in our images, then how should we
>>> proceed in training the tessract
>>>
>>> On Tuesday, April 4, 2017 at 9:27:06 PM UTC+5:30, Saurabh Srivastav
>>> wrote:

 Yes, i trained my tesseract for eng font and make them read the
 characters from image.

> thanks,
>> Saurabh Srivastav
>>
> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/9c88494c-6d80-4b31-b247-dbbacd48bc19%40goo
>>> glegroups.com
>>> 
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to 

Re: [tesseract-ocr] Re: train tesseract OCR 4.0

2017-04-05 Thread ShreeDevi Kumar
4.0 is alpha software. Please use an older released version.

- excuse the brevity, sent from mobile

On 05-Apr-2017 1:55 PM,  wrote:

> After u have said,
>
> I tried in two ways and i am stuck at lstm step:
>
> Training
>
> command used:
>
> /home/p/Documents/T/tesseract-master/training/lstmtraining -U
> /home/p/Documents/T/img_frm_3/eng.unicharset \
> >   --script_dir /home/p/Documents/T/TESS_4_ALPHA/langdata-master
> --debug_interval 100 \
> >   --net_spec '[1,36,0,1 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256
> O1c105]' \
> >   --model_output /home/p/Documents/T/ \
> >   --train_listfile /home/p/Documents/T/img_frm_3/eng.ArialBold.exp0.txt
> \
> >   --eval_listfile /home/p/Documents/T/img_frm_3/eng.ArialBold.exp0.txt \
> >   --max_iterations 5000 &>/home/p/Documents/T/basetrain.log
>
> tail -f basetrain.log
> Error getting is :
>
>
> Deserialize header failed: BnO. 005 SUBHISHIs TOWN CENTRE
> Deserialize header failed: MOKILA SHAKARPALLY
> Deserialize header failed: PHONE: 040-8989898989
> Load of page 0 failed!
> Load of images failed!!
> Deserialize header failed: TIN: 8989898989
> Deserialize header failed: Station 1D: 01 Time: 03:26:46 PM
> Deserialize header failed: CASHIER ID:; 3001 Date: 21-02-2017
> Deserialize header failed: (null)
> Deserialize header failed: (null)
>
>
>
>
>
>
>
>
> Fine tuning:
>
> command used:-
>
> /home/plianto/Documents/Tvat/tesseract-master/training/tesstrain.sh
> --fonts_dir /usr/share/fonts --lang eng --linedata_only \
>   --training_text /home/plianto/Documents/Tvat/
> img_frm_3/eng.ArialBold.exp0.txt \
>   --langdata_dir /home/plianto/Documents/Tvat/TESS_4_ALPHA/langdata-master
> --tessdata_dir /usr/share/tesseract-ocr/tessdata \
>   --fontlist "Arial Bold" \
>   --output_dir /home/plianto/Documents/Tvat/engoutput/
>
> error:
>
> === Phase E: Generating lstmf files ===
> Using TESSDATA_PREFIX=/usr/share/tesseract-ocr/tessdata
> [Wed Apr 5 13:53:05 IST 2017] /usr/local/bin/tesseract
> /tmp/tmp.KTk3WgBTWk/eng/eng.Arial_Bold.exp0.tif
> /tmp/tmp.KTk3WgBTWk/eng/eng.Arial_Bold.exp0 lstm.train
> read_params_file: Can't open lstm.train
> Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
> Page 1
> ERROR: /tmp/tmp.KTk3WgBTWk/eng/eng.Arial_Bold.exp0.lstmf does not exist
> or is not readable
>
>
>
>
>
>
>
>
>
> On Wednesday, April 5, 2017 at 9:07:40 AM UTC+5:30, shree wrote:
>>
>> Read
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTess
>> eract-4.00---Finetune
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTess
>> eract-4.00---Replacing-Top-Layer-Example
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTess
>> eract-4.00---Replace-Top-Layer
>>
>> and
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/Documentation
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/Fonts
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/FAQ
>>
>>
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Wed, Apr 5, 2017 at 12:54 AM,  wrote:
>>
>>> Can you please post some experiences in this post, as there are no posts
>>> to train tesseract 4.
>>>
>>> 1)And also, is there any way to add the new trained data file to old
>>> trained data file, without replacing the old file.
>>> 2)If we dont know what font we may get in our images, then how should we
>>> proceed in training the tessract
>>>
>>> On Tuesday, April 4, 2017 at 9:27:06 PM UTC+5:30, Saurabh Srivastav
>>> wrote:

 Yes, i trained my tesseract for eng font and make them read the
 characters from image.

> thanks,
>> Saurabh Srivastav
>>
> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/9c88494c-6d80-4b31-b247-dbbacd48bc19%40goo
>>> glegroups.com
>>> 
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this 

Re: [tesseract-ocr] Tesseract (4 alpha ) Amibiguos Situation while Correcting Chars in box file

2017-04-05 Thread ShreeDevi Kumar
Have you tried just using the eng.traineddata directly with tess 3.04/ 3.05
/ 4.0?

You don't need to train unless it is a very special case. You can try
changing the dictionary dawg files with tess 3.0x.




ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Apr 5, 2017 at 11:25 AM,  wrote:

> I am trying to correct box files, so i can train tesseract.
>
> But I have got strange problem,
>
>
> 1) Tesseract is recognizing some alphabet as two letters, then how to edit
> the box file then.. (screenshot 1).
> 2) Tesseract is not recognizing some alphabets so how to edit the box file
> then.. (screenshot 2).
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/8acd28ca-fa7f-4be6-a293-ec3008ffd288%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX5RSr0myJhivnXc50KzU0H5KN2Mghv6k6COkcp8%2BBELQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: train tesseract OCR 4.0

2017-04-04 Thread ShreeDevi Kumar
Read

https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Replacing-Top-Layer-Example

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Replace-Top-Layer

and

https://github.com/tesseract-ocr/tesseract/wiki/Documentation

https://github.com/tesseract-ocr/tesseract/wiki/Fonts

https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage

https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

https://github.com/tesseract-ocr/tesseract/wiki/FAQ




ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Apr 5, 2017 at 12:54 AM,  wrote:

> Can you please post some experiences in this post, as there are no posts
> to train tesseract 4.
>
> 1)And also, is there any way to add the new trained data file to old
> trained data file, without replacing the old file.
> 2)If we dont know what font we may get in our images, then how should we
> proceed in training the tessract
>
> On Tuesday, April 4, 2017 at 9:27:06 PM UTC+5:30, Saurabh Srivastav wrote:
>>
>> Yes, i trained my tesseract for eng font and make them read the
>> characters from image.
>>
>>> thanks,
 Saurabh Srivastav

>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/9c88494c-6d80-4b31-b247-dbbacd48bc19%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXONmbtzqrDoSf2JBEG1nSq8BxjQtpjh7w7OHTHnRHQjQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] train tesseract OCR 4.0

2017-04-04 Thread ShreeDevi Kumar
Tesstrain.sh generates a file called eng.training_files.txt

You are using command without .text extension

Check the name of generated file and use that.

I have found that editing that file also gives errors.
- excuse the brevity, sent from mobile

On 04-Apr-2017 7:01 PM,  wrote:

> I am trying to tesseract 4,, and i am getting folowing error,,
>
> command used:
>
> mkdir -p /home/p/Documents/T/engoutput
> /home/p/Documents/T/tesseract-master/training/lstmtraining -U
> /home/p/Documents/T/img_frm_3/unicharset \
>   --script_dir /home/p/Documents/T/TESS_4_ALPHA/langdata-master
> --debug_interval 100 \
>   --train_listfile /home/p/Documents/T/TESS_4_
> ALPHA/langdata-master/eng/eng.training_files \
>   --eval_listfile /home/p/Documents/T/TESS_4_
> ALPHA/langdata-master/eng/eng.training_files \
>   --max_iterations 5000 &>/home/p/Documents/T/basetrain.log
>
> used for log:
> tail -f basetrain.log
> Failed to load list of training filenames from /home/p/Documents/T/TESS_4_
> ALPHA/langdata-master/eng/eng.training_files
> tail: basetrain.log: file truncated
>
>
>
> error getting:
> Failed to load list of training filenames from /home/p/Documents/T/TESS_4_
> ALPHA/langdata-master/eng/eng.training_files
>
>
>
>
> On Tuesday, April 4, 2017 at 6:23:33 PM UTC+5:30, shree wrote:
>>
>> See
>>
>> https://github.com/tesseract-ocr/tesseract/blob/master/train
>> ing/tesstrain.sh
>>
>> https://github.com/tesseract-ocr/tesseract/blob/master/train
>> ing/tesstrain_utils.sh
>>
>> https://github.com/tesseract-ocr/tesseract/blob/master/train
>> ing/language-specific.sh
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/77c03857-e090-4a68-9cb9-505ff9ba52d4%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVNi1K8LRrtHv0fGvWJysn--OSStW932s%2BiRYFPX8L3qw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] train tesseract OCR 4.0

2017-04-04 Thread ShreeDevi Kumar
See

https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain.sh

https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain_utils.sh

https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV99at4Uzvyk4HxxMONL%3DB51V-MV7GS8HNk11ziqkD5xQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] train tesseract OCR 4.0

2017-04-03 Thread ShreeDevi Kumar
Saurabh,

It depends on what you want to do with the bash script.

Here is a sample of a script I used to compare results using diff tessdata
files by looping thru a set of image files. Google the bash commands to
figure out what they do!

#!/bin/bash
set -vx
export TESSDATA_PREFIX=/mnt/c/Users/User/shree/tesseract-ocr

img_files=$(ls *.jpeg)
for img_file in ${img_files}; do
time tesseract ${img_file} ${img_file%.*}-ssd  -l ssd
time tesseract ${img_file} ${img_file%.*}-ssdsmall  --psm 6 --oem 1
-l ssdsmall
time tesseract ${img_file} ${img_file%.*}-eng  --psm 6 --oem 1 -l
eng
done


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Apr 3, 2017 at 7:10 PM, Saurabh Srivastav <
saurabhkumarsrivas...@gmail.com> wrote:

> hello  shree ! thank you for your help.
> may you please help me how can i write a bash  script for tesseract.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/ac53f578-d14c-401b-b65e-b222fe4cb067%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWM5M%2BnQ%3Dbg_3EV%2Bbj6ViXYVCMgNWprQA6uwWr3vzdGuw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Error while creating training data for Japanese

2017-04-03 Thread ShreeDevi Kumar
jpn.config in langdata/jpn is loading jpn_vert as a sublanguage

tessedit_load_sublangs jpn_vert

You can try without that

Also look at the settings for jpn in training/language_specific.sh

You may need to change the following also ..


# The following fonts will be rendered vertically in phase I.
VERTICAL_FONTS=( \
"TakaoExGothic" \ # for jpn
"TakaoExMincho" \ # for jpn
"AR PL UKai Patched" \ # for chi_tra
"AR PL UMing Patched Light" \ # for chi_tra
"Baekmuk Batang Patched" \ # for kor
)


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Apr 3, 2017 at 4:22 PM,  wrote:

> Hi,
>
> I'm trying to creating training data for Japanese (jpn.traineddata).
>
> I run 'tesstrain.sh' with '--linedataonly' option, and the script has
> finished ( return code 0 ) .
> But log file has contained some error messages ( repeated 22 times ).
>
> ```
> $ ../tesseract-ocr/training/tesstrain.sh --fonts_dir /usr/share/fonts
> --lang jpn --linedata_only   --noextract_font_properties --langdata_dir
> ../langdata   --tessdata_dir /usr/local/share --output_dir ~/work/jpntrain
> ```
>
>
> ---
> [Sun Apr 2 07:42:30 UTC 2017] /usr/local/bin/tesseract
> /tmp/tmp.pwcwGMb5hs/jpn/jpn.IPAPMincho.exp0.tif
> /tmp/tmp.pwcwGMb5hs/jpn/jpn.
> IPAPMincho.exp0 lstm.train ../langdata/jpn/jpn.config
> [Sun Apr 2 07:42:30 UTC 2017] /usr/local/bin/tesseract
> /tmp/tmp.pwcwGMb5hs/jpn/jpn.IPAGothic.exp0.tif
> /tmp/tmp.pwcwGMb5hs/jpn/jpn.I
> PAGothic.exp0 lstm.train ../langdata/jpn/jpn.config
> Error opening data file /usr/local/share/tessdata/jpn_vert.traineddata
> Please make sure the TESSDATA_PREFIX environment variable is set to the
> parent directory of your "tessdata" directory.
> Failed loading language 'jpn_vert'
> ---
>
> It seems that 'tesstrain.sh' requires 'jpn_vert.traineddata`, but this
> file not provide on tessdata repository.
>
> How I get this file? Or, Can I substitute  'jpn.traineddata' for
>  'jpn_vert.traineddata' ?
>
>
> I've found that there is `jpn_vert' directory on langdata repository, but
> only some config files.
>
>
> Thanks.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/c776398d-0b2f-483d-a9ec-63476eaf0586%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUXiMCsyMXtaV-mBiq1E1OhJqV-obaMHLkizjnivUMtiQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] VietOCR 5.0 alpha availability

2017-04-03 Thread ShreeDevi Kumar
You need to get vietocr 5.0 alpha for tesseract 4.0 alpha

https://sourceforge.net/projects/vietocr/files/vietocr.net/5.0alpha/

https://sourceforge.net/projects/vietocr/files/vietocr/5.0alpha/

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Apr 3, 2017 at 2:52 PM, El Fakir Zakaria  wrote:

> this is using Tesseract 3.04 not 4.00alpha ?
>
> 2017-03-31 18:13 GMT+01:00 Quan Nguyen :
>
>> VietOCR 5.0 alpha, Java & .NET GUI frontend for Tesseract 4.00alpha, is
>> available for download. Any feedback is welcome. Thanks.
>>
>> https://sourceforge.net/projects/vietocr/files/
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/aa63499d-1375-4c08-bf1d-e87c00f9b8cd%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/CALjY3nP4%2BA68yvfyVXGdFQATTMkVc7BpQdk_
> 5VBgKQDMte-vKw%40mail.gmail.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW0aA_33v-PGCXJJ8_vOw_1iSz4OaXsf4st0Kf_9EdLRQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


<    1   2   3   4   5   6   7   8   >