[tesseract-ocr] Where is /path/to/eng.user-words?

2018-04-02 Thread 이경준
Hi ..


I incited this page .

I cannot find (lang).user-words .

How can I find? 


Tesseract config files consist of lines with variable-value pairs (space 
separated). The variables are documented as flags in the source code like 
the following one in tesseractclass.h:

STRING_VAR_H(tessedit_char_blacklist, "", "Blacklist of chars not to 
recognize");

These variables may enable or disable various features of the engine, and 
may cause it to load (or not load) various data. For instance, let’s 
suppose you want to OCR in English, but suppress the normal dictionary and 
load an alternative word list and an alternative list of patterns — these 
two files are the most commonly used extra data files.

If your language pack is in /path/to/eng.traineddata and the hocr config is 
in /path/to/configs/hocr then create three new files:

/path/to/eng.user-words:

the
quick
brown
fox
jumped

/path/to/eng.user-patterns:

1-\d\d\d-GOOG-411
www.\n\\\*.com

/path/to/configs/bazaar:

load_system_dawg F
load_freq_dawg   F
user_words_suffixuser-words
user_patterns_suffix user-patterns

Now, if you pass the word *bazaar* as a trailing command line parameter to 
Tesseract, Tesseract will not bother loading the system dictionary nor the 
dictionary of frequent words and will load and use the eng.user-words and 
eng.user-patterns files you provided. The former is a simple word list, one 
per line. The format of the latter is documented in dict/trie.h on 
read_pattern_list().

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5d06132a-a726-42ea-825b-4d1f6ac5083c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: lstmtraining command line related

2018-03-28 Thread 이경준
Okay .. ㅜㅜ Sorry I observed rule

Thank You

2018-03-29 13:40 GMT+09:00 shree :

> PLEASE DO NOT SHOUT - Sending messages in Large fontsize, RED color etc is
> not appreciated.
>
> You have used a 0-zero instead of a CAPITAL O in your network spec, it
> should be O1c105
>
>
> On Wednesday, March 28, 2018 at 12:24:02 PM UTC+5:30, notorio...@gmail.com
> wrote:
>>
>>
>>
>> *Invalid network spec:01c105]*
>> *Missing ] at end of [Series]!*
>> *Failed to create network from spec: [1,0,0,1 Ct5,5,16 Mp3,3 Lfys64
>> Lfx128 Lrx128 Lfx256 01c105]*
>> 2018년 3월 28일 수요일 오후 3시 53분 17초 UTC+9, notorio...@gmail.com 님의 말:
>>>
>>>  I type the command line in my computer ubuntu 16.04.03 LTS
>>>
>>> sudo lstmtraining --debug_interval -1 --traineddata
>>> /usr/share/tesseract-ocr/4.00/tessdata/kor.traineddata --net_spec*
>>> '[1,0,0,1 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 01c105]'*
>>> --train_listfile /usr/share/tesseract-ocr/4.00/
>>> tessdata/tesseract/training/trained_plus_chars_kor/kor.training_files.txt
>>> --eval_listfile /usr/share/tesseract-ocr/4.00/
>>> tessdata/tesseract/training/eval_plus_chars_kor/kor.training_files.txt
>>> --max_iterations 5000
>>>
>>>
>>> I have an error .
>>>
>>>
>>> like
>>>
>>>
>>> Invalid network spec:01c105]
>>> Missing ] at end of [Series]!
>>> Failed to create network from spec: [1,0,0,1 Ct5,5,16 Mp3,3 Lfys64
>>> Lfx128 Lrx128 Lfx256 01c105]
>>>
>>>
>>> But, I saw the wiki page
>>>
>>> https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs
>>>
>>>
>>> Full Example: A 1-D LSTM capable of high quality OCR
>>>
>>> [1,1,0,48 Lbx256 O1c105]
>>>
>>> As layer descriptions: (Input layer is at the bottom, output at the top.)
>>>
>>> O1c105: Output layer produces 1-d (sequence) output, trained with CTC,
>>>   outputting 105 classes.
>>> Lbx256: Bi-directional LSTM in x with 256 outputs
>>> 1,1,0,48: Input is a batch of 1 image of height 48 pixels in greyscale, 
>>> treated
>>>   as a 1-dimensional sequence of vertical pixel strips.
>>> []: The network is always expressed as a series of layers.
>>>
>>> This network works well for OCR, as long as the input image is carefully
>>> normalized in the vertical direction, with the baseline and meanline in
>>> constant places.
>>>
>>> Full
>>> Example: A multi-layer LSTM capable of high quality OCR
>>>
>>> *[1,0,0,1 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]*
>>>
>>> As layer descriptions: (Input layer is at the bottom, output at the top.)
>>>
>>> O1c105: Output layer produces 1-d (sequence) output, trained with CTC,
>>>   outputting 105 classes.
>>> Lfx256: Forward-only LSTM in x with 256 outputs
>>> Lrx128: Reverse-only LSTM in x with 128 outputs
>>> Lfx128: Forward-only LSTM in x with 128 outputs
>>> Lfys64: Dimension-summarizing LSTM, summarizing the y-dimension with 64 
>>> outputs
>>>
>>>
>>> Mp3,3: 3 x 3 Maxpool
>>> Ct5,5,16: 5 x 5 Convolution with 16 outputs and tanh non-linearity
>>> 1,0,0,1: Input is a batch of 1 image of variable size in greyscale*[]: The 
>>> network is always expressed as a series of layers.*
>>>
>>>
>>>
>>>
>>> *I have no idea .. why I type [ ] these charcter put in there . Take place 
>>> an error *
>>>
>>>
>>> *Could you help me .?? *
>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/e1b97153-13b9-40d6-b583-417a13ace47e%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFsCaLOF64-WX3kxCTUCuccZcd_xJoNySvQVJ5NQttUp-7vA3Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] What is the (lang).xhegits & How can I make (lang).xhegits?

2018-03-25 Thread 이경준
Hi. I'm using tesstrain.sh (for rendering) 

with my korean fonts, and korean training_text

and In the /tmp --- folder I found the kor.xhegihts 





What is the kor.xhegiths?

this file affect rendering? 

Could I modify this file before rendering ?? 

Plz let me know~ 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/556c8785-38b8-4988-bba0-f401a74c762f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] In specific Domain, What ways is suitable for training tesseract 4.00 - beta.1

2018-03-23 Thread 이경준
In specific Domain, What ways is suitable for training tesseract 4.00 - 
beta.1

I can fine tunely with images or I can fine tunely with texts.

But, I want to use tesseract for business

So, I think that tesseract 's threshold > 0.95 ~ 0.975

But, My specific situation. I have torecognize All the images (I have)by 
using tesseract, And in my situation, I don't have to recoginzie all the 
images I don't have by using tesseract.

Do you know what I'm saying ?

e.g.

If ) I have 1000 images, So, I just wan to only recognize these images by 
using tesseract 4.00 - beta1.



### What about your opinion ? 

I think that tranining _ few _ layers is better than Fining tune . 

Am I right ? 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/20d2a092-b63b-41ce-a384-6e89c36a1048%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] How to fine tunely tesseract 4.00 Beta.1 (by using Not text But exisiting images ?

2018-03-22 Thread 이경준


Hi 

In the tesseract 4.00 Beta.1
I know the way to fine tunely with the text, But I don't know how to fine 
tunely with the exisiting images I have

Plz answer my question~ 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8ce70016-aa9f-4d2c-b082-4d16adf89ac1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] What is the flag option "--pass_through_recoder" in the combine_lang_model

2018-03-22 Thread 이경준
Hi 

I have a question 

USAGE: combine_lang_model
  --lang_is_rtl  True if lang being processed is written right-to-left  
(type:bool default:false)
  --pass_through_recoder  If true, the recoder is a simple pass-through of 
the unicharset. Otherwise, potentially a compression of it  (type:bool 
default:false)
  --input_unicharset  Unicharset to complete and use in encoding  
(type:string default:)
  --script_dir  Directory name for input script unicharsets  (type:string 
default:)
  --words  File listing words to use for the system dictionary  
(type:string default:)
  --puncs  File listing punctuation patterns  (type:string default:)
  --numbers  File listing number patterns  (type:string default:)
  --output_dir  Root directory for output files  (type:string default:)
  --version_str  Version string to add to traineddata file  (type:string 
default:)
  --lang  Name of language being processed  (type:string default:)


What is the flag option "--pass_through_recoder" in the combine_lang_model

I dont'know 

Who answer for me~ plz~

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d579ca0c-d062-4944-8748-7fa92e9edb94%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] what is. Tesseract (4.00)release plan~

2018-03-20 Thread 이경준
what is. Tesseract (4.00)release plan~
Could I get the answer?

@theraysmith

Thank U

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7bb9fc81-9e10-40b4-b513-4a1a92196500%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] How to use existing image( I owned) to fine tune in the tesseract 4.00 -beta1.

2018-03-19 Thread 이경준
How to use existing image( I owned) to fine tune in the tesseract 4.00 
-beta1.

I know the way to fine tune in the tesseract (4.00) beta in text by adding 
some text line based on 100~120(which from (lang).training_text (langdata - 
directory)

But , I have no idea with the images . to fine tune in the tesseract 4.00 
-beta1

Could you give me the answer plz~ 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/dd49f686-810c-4abd-8a14-822ef2d23791%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Where can I get the langdata for 4.00 /// In present , Only for 3.04 file is uploaded.

2018-03-17 Thread 이경준
Where can I get the langdata for 4.00 /// In present , Only for 3.04 file 
is uploaded. 


https://github.com/tesseract-ocr/langdata


There are datas for tesseract 3.04..


ㅜㅜ 
plz

Ray smith , Wolud you give me the information about langdata (4.00)-KOREAN 

Thank YOU

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c4976011-7e43-473a-8d15-74162579f26e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] What is differnt tesseracr 4.00(alpha) from tesseract4.00(Beta) in details ?

2018-03-16 Thread 이경준
OH THANK you

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a0e02842-f1da-46b9-9b8b-28bfdbbdea52%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] What is differnt tesseracr 4.00(alpha) from tesseract4.00(Beta) in details ?

2018-03-16 Thread 이경준
Hi ~

What is differnt tesseracr 4.00(alpha) from tesseract4.00(Beta) in details ?

Thank You

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/498ae129-db2a-4964-aba1-874d9a0f69c8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Is possible to train repetedly "onced(1번) fine tuned data" over and over again

2018-03-16 Thread 이경준
Is possible to train repetedly "onced(1번) fine tuned data"  over and over 
again ??? 

But, I have an error 

like segemenation - dumped (core) 

I want to have an solution 

Bye 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c166c1fc-cee2-4312-8427-f2c8d3735598%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] kor.numbers is what?

2018-03-16 Thread 이경준




I saw the "kor.numbers"

But , there are no characters 


What is it?

thank U 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ff917b53-b1ec-450d-86c4-7df06c0a4e13%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Depending on OS, tesseract (4.0) performance is different?

2018-03-15 Thread 이경준


<https://lh3.googleusercontent.com/-tw2KAlGrnno/WqsIOVBJTEI/M2c/tzskGa8GfUsmxOcZRR_Vst47vGyBptqRACLcBGAs/s1600/2.png>

<https://lh3.googleusercontent.com/-7AcT-f0s75k/WqsILjIDOPI/M2Y/oi1OqK8PTd4rRN6p_aBr0VtbXPnjEIbBACLcBGAs/s1600/1.png>

I think that

Yet, PPA for tesseract 4.00 Beta doesn't exist

Am I false? 


2018년 3월 16일 금요일 오전 1시 9분 11초 UTC+9, shree 님의 말:
>
> No.
>
> You can use Alex's PPA and install for your version of Ubuntu.
>
>
>
> On Thu 15 Mar, 2018, 9:16 PM 이경준, <player...@gmail.com > 
> wrote:
>
>> Now Im installing ubuntu 18.04 for tesseract4.00 beta.1
>>
>> Is it right?
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/fc1b573a-246c-41c1-899a-abb0c1d1f21f%40googlegroups.com
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/759970e2-f1cf-49f1-b1c9-714ff6d73327%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Depending on OS, tesseract (4.0) performance is different?

2018-03-15 Thread 이경준
Now Im installing ubuntu 18.04 for tesseract4.00 beta.1

Is it right?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fc1b573a-246c-41c1-899a-abb0c1d1f21f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Depending on OS, tesseract (4.0) performance is different?

2018-03-15 Thread 이경준
Thank u so much

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/24d1c1bb-13b1-4085-a548-21a46523cb95%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Depending on OS, tesseract (4.0) performance is different?

2018-03-15 Thread 이경준
Totally / Completeley Tesseract 4.0 (alpha) to Delete way is 

$ sudo apt-get remove tesseract-ocr

$ sudo apt auto remove 

Is it right? 

2018년 3월 15일 목요일 오후 7시 8분 9초 UTC+9, 이경준 님의 말:
>
> I think that Is it possible to install tesseract 4.0 beta like instaliing 
> patch file 
> Sorry ㅠㅠ

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/766c4695-3121-4c51-8d6f-87769c26709f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Depending on OS, tesseract (4.0) performance is different?

2018-03-15 Thread 이경준
I think that Is it possible to install tesseract 4.0 beta like instaliing patch 
file
Sorry ㅠㅠ

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/df25341d-71f7-4280-9fe5-45a35121f3b6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Depending on OS, tesseract (4.0) performance is different?

2018-03-15 Thread 이경준
Hi
First ppa setting
Second sudo apt get remove
Third sudo apt get installl tesseract ocr

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2c547b34-8b62-4c82-8791-e4b013b2031a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Depending on OS, tesseract (4.0) performance is different?

2018-03-15 Thread 이경준

attached file is my training _text

for finely tuninig, I use fonts 

train for fonts : Baekmuk Dotum , Baekmuk Gulim, Baekmuk Headline

eval for fonts :  Baekmuk Batang

You can install Baekmuk Font

$ apt-get install fonts-baekmuk

And I don't know font that is giving me error 

so I explain my training environment & settings

2018년 3월 15일 목요일 오후 5시 5분 43초 UTC+9, 이경준 님의 말:

> Thank you so much .
>
> 1) how to replace tesseract 4.00 alpha with tesseract 4.00 Beta ?
>
> Thank you
>
> 2018년 3월 15일 목요일 오후 4시 56분 59초 UTC+9, shree 님의 말:
>>
>> >  tesseract 4.0 Alpha on Ubuntu 16.04.03 LTS 
>>
>> Please use latest version beta.1 or build from source on github.
>>
>> > They are operated by Windows . I Think. 
>>
>> No, they are not operated by windows. They run on 'bash under winodws' 
>> which provides Ubuntu 14.04. It can use fonts installed under windows.
>>
>> > Depending on OS, tesseract (4.0) performance is different?  
>>
>> Quite possible. It will also depend on how many changes from github are 
>> included in each.
>>
>> > I finally Do not solve can't encode transcription , after replacing 
>> top layer 
>>
>> I cannot reproduce the problem. Please send your training_text and font 
>> that is giving you error so that I can check with it.
>>
>> bash script will not run directly on windows.
>>
>>
>>
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Thu, Mar 15, 2018 at 12:43 PM, 이경준 <player...@gmail.com> wrote:
>>
>>> Hi Shree, I'm using tesseract 4.0 Alpha on Ubuntu 16.04.03 LTS
>>>
>>> But, you give me 2 scripts for tesseract 4.0 
>>>
>>> In the scripts, They are operated by Windows . I Think.
>>>
>>> So, I change my running enviornment of Tesseract 4.0. (Ubuntu -> Window 
>>> 10)
>>>
>>> Depending on OS, tesseract (4.0) performance is different? 
>>>
>>> And I finally Do not solve can't encode transcription , after replacing 
>>> top layer
>>>
>>> I conclude that I have to change my Os.(Ubuntu -> window 10) 
>>>
>>> Thank you
>>>
>>> Can you check my training_text for fine tuning? 
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/e6b57b7a-5ba6-4ca6-a7eb-864a2abf94a2%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/e6b57b7a-5ba6-4ca6-a7eb-864a2abf94a2%40googlegroups.com?utm_medium=email_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f0e2f9d9-4299-49de-a8eb-e448eea9a682%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


kor.plus.training_text
Description: Binary data


Re: [tesseract-ocr] Depending on OS, tesseract (4.0) performance is different?

2018-03-15 Thread 이경준
Thank you so much .

1) how to replace tesseract 4.00 alpha with tesseract 4.00 Beta ?

Thank you

2018년 3월 15일 목요일 오후 4시 56분 59초 UTC+9, shree 님의 말:
>
> >  tesseract 4.0 Alpha on Ubuntu 16.04.03 LTS 
>
> Please use latest version beta.1 or build from source on github.
>
> > They are operated by Windows . I Think. 
>
> No, they are not operated by windows. They run on 'bash under winodws' 
> which provides Ubuntu 14.04. It can use fonts installed under windows.
>
> > Depending on OS, tesseract (4.0) performance is different?  
>
> Quite possible. It will also depend on how many changes from github are 
> included in each.
>
> > I finally Do not solve can't encode transcription , after replacing top 
> layer 
>
> I cannot reproduce the problem. Please send your training_text and font 
> that is giving you error so that I can check with it.
>
> bash script will not run directly on windows.
>
>
>
>
>
> ShreeDevi
> ________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Thu, Mar 15, 2018 at 12:43 PM, 이경준 <player...@gmail.com > 
> wrote:
>
>> Hi Shree, I'm using tesseract 4.0 Alpha on Ubuntu 16.04.03 LTS
>>
>> But, you give me 2 scripts for tesseract 4.0 
>>
>> In the scripts, They are operated by Windows . I Think.
>>
>> So, I change my running enviornment of Tesseract 4.0. (Ubuntu -> Window 
>> 10)
>>
>> Depending on OS, tesseract (4.0) performance is different? 
>>
>> And I finally Do not solve can't encode transcription , after replacing 
>> top layer
>>
>> I conclude that I have to change my Os.(Ubuntu -> window 10) 
>>
>> Thank you
>>
>> Can you check my training_text for fine tuning? 
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/e6b57b7a-5ba6-4ca6-a7eb-864a2abf94a2%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/e6b57b7a-5ba6-4ca6-a7eb-864a2abf94a2%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/79288364-172c-45de-a5c1-61af7aea3723%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Depending on OS, tesseract (4.0) performance is different?

2018-03-15 Thread 이경준
Plus) could you give me some advice on running tesseract 4.0 on Window? 

such as using bash script. etc ..

Thank You

2018년 3월 15일 목요일 오후 4시 13분 25초 UTC+9, 이경준 님의 말:
>
> Hi Shree, I'm using tesseract 4.0 Alpha on Ubuntu 16.04.03 LTS
>
> But, you give me 2 scripts for tesseract 4.0 
>
> In the scripts, They are operated by Windows . I Think.
>
> So, I change my running enviornment of Tesseract 4.0. (Ubuntu -> Window 10)
>
> Depending on OS, tesseract (4.0) performance is different? 
>
> And I finally Do not solve can't encode transcription , after replacing 
> top layer
>
> I conclude that I have to change my Os.(Ubuntu -> window 10) 
>
> Thank you
>
> Can you check my training_text for fine tuning? 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/55351493-49b6-498a-960f-2ca9e84c4974%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Depending on OS, tesseract (4.0) performance is different?

2018-03-15 Thread 이경준
Hi Shree, I'm using tesseract 4.0 Alpha on Ubuntu 16.04.03 LTS

But, you give me 2 scripts for tesseract 4.0 

In the scripts, They are operated by Windows . I Think.

So, I change my running enviornment of Tesseract 4.0. (Ubuntu -> Window 10)

Depending on OS, tesseract (4.0) performance is different? 

And I finally Do not solve can't encode transcription , after replacing top 
layer

I conclude that I have to change my Os.(Ubuntu -> window 10) 

Thank you

Can you check my training_text for fine tuning? 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e6b57b7a-5ba6-4ca6-a7eb-864a2abf94a2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] lstmtraining : can't transciption encode error (korean) (tesseract 4.0)

2018-03-13 Thread 이경준



lstmtraining : can't transciption encode error (korean)

I'm doing fine tune (tesseract 4.0)

this error take place

so I take a solution 1) replacing top layer 

but, This error doens't disapper ..

next step for solving this error ?? 

Could you tell me the steps


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/73e26364-b357-4399-8194-c2da29dbb7d4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] message from runnig tesseract from my tuned traineddata(korean)

2018-03-13 Thread 이경준
Thank you. ㅜㅜplz help me. ..
After replacing top layer. I still error message can't encodegithub 
issueㅜㅜ

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7e12683f-0eeb-4b54-8850-6c17f0b096af%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] message from runnig tesseract from my tuned traineddata(korean)

2018-03-13 Thread 이경준





Hi. I got the message from runnig tesseract from my tuned 
traineddata(korean)

1) Whenver I use tesseract , tesseract always give me a message = "Warning 
. Invalid resolution 0 dip. Using Constant(e.g 30, 70, 80 , 120,...) instead

Is it oK?

2) I'm using my korean tuned fine tuned traineddata but, always give 
message like that " Error opening data file  /chi_tra.trainddata " 
please make sure the TESSDATA_PREFIX environment varialbe

Is it OK?

shree you teach me ///refer to kor.config 

and I saw the kor.config.. so I must refer to kor.config 

tessedit_load_sublangs chi_tra < 


so I must to download chi_tra.traineddata in my tessdata folder and When I 
trained korean traineddata , I must train chi_tra.traineddata ??? 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b68aeb24-510d-4957-b8ba-dc5a66a08c31%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: pango library doesn't recognize my font .

2018-03-13 Thread 이경준


<https://lh3.googleusercontent.com/-HnvFroiEXnM/Wqf9U-M2P3I/M1Y/yAqjN8Rpfm0pugO3FfiCV0GPmTcj-NBIQCLcBGAs/s1600/fonts.png>


Thank U . I delete last line you taught me . 

I can see  lots of korean fonts before

In there everything can be used for training??? 

but I have an error 

argument fonts "(specifi_font') are not assigned (like that) 

>
> and I saw the this issue _ github 
> https://github.com/tesseract-ocr/tesseract/issues/688
>
> 2018년 3월 14일 수요일 오전 1시 7분 11초 UTC+9, shree 님의 말:
>>
>> Did you use the fonts_dir where they are installed???
>>
>> On Tue 13 Mar, 2018, 9:32 PM 이경준, <player...@gmail.com> wrote:
>>
>>> Thank U . I have a fontslist file 
>>>
>>> but vim fontlist.txt 
>>>
>>> There are no fonts ?? 
>>>
>>> It means that I cannot use korena fonts?? 
>>>
>>> 2018년 3월 13일 화요일 오후 9시 9분 45초 UTC+9, shree 님의 말:
>>>>
>>>> Give the following command - after changing directories to match your 
>>>> setup
>>>>
>>>> text2image --find_fonts \
>>>> --fonts_dir /usr/share/fonts \
>>>> --text ../langdata/kor/kor.training_text \
>>>> --min_coverage .9  \
>>>> --render_per_font false \
>>>> --outputbase ../langdata/kor/kor \
>>>> |& grep raw | sed -e 's/ :.*/" \\/g'  | sed -e 's/^/  "/' 
>>>> >../langdata/kor/fontslist.txt
>>>>
>>>> and then check the selected fonts in 
>>>> ../langdata/kor/fontslist.txt 
>>>>
>>>>
>>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/d13e3159-5282-461b-bafa-57413cb988f8%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/d13e3159-5282-461b-bafa-57413cb988f8%40googlegroups.com?utm_medium=email_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/44260794-f60d-4522-9fd1-7d25f9dde7a8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: pango library doesn't recognize my font .

2018-03-13 Thread 이경준

2018년 3월 14일 수요일 오전 1시 27분 46초 UTC+9, 이경준 님의 말:
>
> yes  ㅜㅜ 
>
> and I saw the this issue _ github 
> https://github.com/tesseract-ocr/tesseract/issues/688
>
> 2018년 3월 14일 수요일 오전 1시 7분 11초 UTC+9, shree 님의 말:
>>
>> Did you use the fonts_dir where they are installed???
>>
>> On Tue 13 Mar, 2018, 9:32 PM 이경준, <player...@gmail.com> wrote:
>>
>>> Thank U . I have a fontslist file 
>>>
>>> but vim fontlist.txt 
>>>
>>> There are no fonts ?? 
>>>
>>> It means that I cannot use korena fonts?? 
>>>
>>> 2018년 3월 13일 화요일 오후 9시 9분 45초 UTC+9, shree 님의 말:
>>>>
>>>> Give the following command - after changing directories to match your 
>>>> setup
>>>>
>>>> text2image --find_fonts \
>>>> --fonts_dir /usr/share/fonts \
>>>> --text ../langdata/kor/kor.training_text \
>>>> --min_coverage .9  \
>>>> --render_per_font false \
>>>> --outputbase ../langdata/kor/kor \
>>>> |& grep raw | sed -e 's/ :.*/" \\/g'  | sed -e 's/^/  "/' 
>>>> >../langdata/kor/fontslist.txt
>>>>
>>>> and then check the selected fonts in 
>>>> ../langdata/kor/fontslist.txt 
>>>>
>>>>
>>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/d13e3159-5282-461b-bafa-57413cb988f8%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/d13e3159-5282-461b-bafa-57413cb988f8%40googlegroups.com?utm_medium=email_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7cdaeac4-9455-470b-a774-4862abe5e6d6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: pango library doesn't recognize my font .

2018-03-13 Thread 이경준
yes  ㅜㅜ 

and I saw the this issue _ 
github https://github.com/tesseract-ocr/tesseract/issues/688

2018년 3월 14일 수요일 오전 1시 7분 11초 UTC+9, shree 님의 말:
>
> Did you use the fonts_dir where they are installed???
>
> On Tue 13 Mar, 2018, 9:32 PM 이경준, <player...@gmail.com > 
> wrote:
>
>> Thank U . I have a fontslist file 
>>
>> but vim fontlist.txt 
>>
>> There are no fonts ?? 
>>
>> It means that I cannot use korena fonts?? 
>>
>> 2018년 3월 13일 화요일 오후 9시 9분 45초 UTC+9, shree 님의 말:
>>>
>>> Give the following command - after changing directories to match your 
>>> setup
>>>
>>> text2image --find_fonts \
>>> --fonts_dir /usr/share/fonts \
>>> --text ../langdata/kor/kor.training_text \
>>> --min_coverage .9  \
>>> --render_per_font false \
>>> --outputbase ../langdata/kor/kor \
>>> |& grep raw | sed -e 's/ :.*/" \\/g'  | sed -e 's/^/  "/' 
>>> >../langdata/kor/fontslist.txt
>>>
>>> and then check the selected fonts in 
>>> ../langdata/kor/fontslist.txt 
>>>
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/d13e3159-5282-461b-bafa-57413cb988f8%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/d13e3159-5282-461b-bafa-57413cb988f8%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1b210d71-2e65-4738-945d-1a534de038a5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: pango library doesn't recognize my font .

2018-03-13 Thread 이경준
Thank U . I have a fontslist file 

but vim fontlist.txt 

There are no fonts ?? 

It means that I cannot use korena fonts?? 

2018년 3월 13일 화요일 오후 9시 9분 45초 UTC+9, shree 님의 말:
>
> Give the following command - after changing directories to match your setup
>
> text2image --find_fonts \
> --fonts_dir /usr/share/fonts \
> --text ../langdata/kor/kor.training_text \
> --min_coverage .9  \
> --render_per_font false \
> --outputbase ../langdata/kor/kor \
> |& grep raw | sed -e 's/ :.*/" \\/g'  | sed -e 's/^/  "/' 
> >../langdata/kor/fontslist.txt
>
> and then check the selected fonts in 
> ../langdata/kor/fontslist.txt 
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d13e3159-5282-461b-bafa-57413cb988f8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: pango library doesn't recognize my font .

2018-03-13 Thread 이경준
Thank U. I have lots of  Korean fonts, But, Only baekmuk fonts do work .

but, I really want to know why pango library. doesn't recognize...

2018년 3월 13일 화요일 오후 7시 48분 44초 UTC+9, shree 님의 말:
>
> remove these two lines and try
>
>--fonts_dir $fonts_dir \
>--fontlist $fonts_for_training \
>
>
> this overrides what is given in language-specific.sh
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Tue, Mar 13, 2018 at 4:11 PM, 이경준 <player...@gmail.com > 
> wrote:
>
>>
>>
>> 2018년 3월 13일 화요일 오후 7시 40분 27초 UTC+9, 이경준 님의 말:
>>
>>> Hi. my name is june. Hi shree. I have a question. I'm using bash script 
>>> you gave me.
>>>
>>>
>>> in the script ..
>>>
>>>
>>> # the EVAL handles the quotes in the font list
>>> eval $tesstrain_dir/tesstrain.sh \
>>>--lang $Lang \
>>>--linedata_only\
>>>--noextract_font_properties \
>>>--exposures "0" \
>>>--fonts_dir $fonts_dir \
>>>--fontlist $fonts_for_training \
>>>--langdata_dir $langdata_dir \
>>>--training_text $langdata_dir/$Lang/$Lang.$plusTraining_text \
>>>--tessdata_dir $bestdata_dir \
>>>--output_dir $train_output_dir
>>>  
>>> P.S everything variables is assgined. and (e.g. 
>>> fonts_for_training="Baekmuk Batang")
>>>
>>>
>>> Run script(above). But I have an error . It doesn't work
>>>
>>>
>>> So I have to delete " --fontlist $fonts_for_training " and  I make a 
>>> pair of tesstrain1.sh & language-specific1.sh (for training_fonts) 
>>>
>>> In this case It does work. 
>>>
>>>
>>> I review my system (ubuntu 16.04.03 LTS) $ fc-list 
>>>
>>> korean. 
>>>
>>> I have lots of korean fonts 
>>>
>>> But, it doesn't work 
>>>
>>> Why pango library doesn't recognize the fonts I installed.
>>>
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/4783cb3a-09ad-47dd-8d0f-099c2fdfafe6%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/4783cb3a-09ad-47dd-8d0f-099c2fdfafe6%40googlegroups.com?utm_medium=email_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/387cce9d-8639-46df-9dd2-20cc3409678c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: pango library doesn't recognize my font .

2018-03-13 Thread 이경준


2018년 3월 13일 화요일 오후 7시 40분 27초 UTC+9, 이경준 님의 말:
>
> Hi. my name is june. Hi shree. I have a question. I'm using bash script 
> you gave me.
>
>
> in the script ..
>
>
> # the EVAL handles the quotes in the font list
> eval $tesstrain_dir/tesstrain.sh \
>--lang $Lang \
>--linedata_only\
>--noextract_font_properties \
>--exposures "0" \
>--fonts_dir $fonts_dir \
>--fontlist $fonts_for_training \
>--langdata_dir $langdata_dir \
>--training_text $langdata_dir/$Lang/$Lang.$plusTraining_text \
>--tessdata_dir $bestdata_dir \
>--output_dir $train_output_dir
>  
> P.S everything variables is assgined. and (e.g. 
> fonts_for_training="Baekmuk Batang")
>
>
> Run script(above). But I have an error . It doesn't work
>
>
> So I have to delete " --fontlist $fonts_for_training " and  I make a pair 
> of tesstrain1.sh & language-specific1.sh (for training_fonts) 
>
> In this case It does work. 
>
>
> I review my system (ubuntu 16.04.03 LTS) $ fc-list 
>
> korean. 
>
> I have lots of korean fonts 
>
> But, it doesn't work 
>
> Why pango library doesn't recognize the fonts I installed.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4783cb3a-09ad-47dd-8d0f-099c2fdfafe6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
#!/bin/bash
# (C) Copyright 2014, Google Inc.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This script provides an easy way to execute various phases of training
# Tesseract.  For a detailed description of the phases, see
# https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
#
# USAGE:
#
# tesstrain.sh
#--fontlist FONTS   # A list of fontnames to train on.
#--fonts_dir FONTS_PATH # Path to font files.
#--lang LANG_CODE   # ISO 639 code.
#--langdata_dir DATADIR # Path to tesseract/training/langdata directory.
#--output_dir OUTPUTDIR # Location of output traineddata file.
#--overwrite# Safe to overwrite files in output_dir.
#--linedata_only# Only generate training data for lstmtraining.
#--run_shape_clustering # Run shape clustering (use for Indic langs).
#--exposures EXPOSURES  # A list of exposure levels to use (e.g. "-1 0 1").
#
# OPTIONAL flags for input data. If unspecified we will look for them in
# the langdata_dir directory.
#--training_text TEXTFILE   # Text to render and use for training.
#--wordlist WORDFILE# Word list for the language ordered by
#   # decreasing frequency.
#
# OPTIONAL flag to specify location of existing traineddata files, required
# during feature extraction. If unspecified will use TESSDATA_PREFIX defined in
# the current environment.
#--tessdata_dir TESSDATADIR # Path to tesseract/tessdata directory.
#
# NOTE:
# The font names specified in --fontlist need to be recognizable by Pango using
# fontconfig. An easy way to list the canonical names of all fonts available on
# your system is to run text2image with --list_available_fonts and the
# appropriate --fonts_dir path.


source "$(dirname $0)/tesstrain_utils.sh"

ARGV=("$@")
parse_flags

mkdir -p ${TRAINING_DIR}
tlog "\n=== Starting training for language '${LANG_CODE}'"

source "$(dirname $0)/language-specific1.sh"
set_lang_specific_parameters ${LANG_CODE}

initialize_fontconfig

phase_I_generate_image 8
phase_UP_generate_unicharset
if ((LINEDATA)); then
  phase_E_extract_features "lstm.train" 8 "lstmf"
  make__lstmdata
else
  phase_D_generate_dawg
  phase_E_extract_features "box.train" 8 "tr"
  phase_C_cluster_prototypes "${TRAINING_DIR}/${LANG_CODE}.normproto"
  if [[ "${ENABLE_SHAPE_CLUSTERING}" == "y" ]]; then
  phase_S_cluster_shapes
  fi
  phase_M_cluster_microfeatures
  phase_

[tesseract-ocr] pango library doesn't recognize my font .

2018-03-13 Thread 이경준
Hi. my name is june. Hi shree. I have a question. I'm using bash script you 
gave me.


in the script ..


# the EVAL handles the quotes in the font list
eval $tesstrain_dir/tesstrain.sh \
   --lang $Lang \
   --linedata_only\
   --noextract_font_properties \
   --exposures "0" \
   --fonts_dir $fonts_dir \
   --fontlist $fonts_for_training \
   --langdata_dir $langdata_dir \
   --training_text $langdata_dir/$Lang/$Lang.$plusTraining_text \
   --tessdata_dir $bestdata_dir \
   --output_dir $train_output_dir
 
P.S everything variables is assgined. and (e.g. fonts_for_training="Baekmuk 
Batang")


Run script(above). But I have an error . It doesn't work


So I have to delete " --fontlist $fonts_for_training " and  I make a pair 
of tesstrain1.sh & language-specific1.sh (for training_fonts) 

In this case It does work. 


I review my system (ubuntu 16.04.03 LTS) $ fc-list 

korean. 

I have lots of korean fonts 

But, it doesn't work 

Why pango library doesn't recognize the fonts I installed.


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/91a9792f-bbc1-43e2-99c9-e3ebe472e78e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to replace top LSTM top layer ?

2018-03-13 Thread 이경준
Thank U

2018년 3월 13일 화요일 오후 5시 7분 11초 UTC+9, shree 님의 말:
>
> That command applies to an older version of the source code.
>
> Now you need a starter traineddata.
>
> Please see the wiki page at 
>
> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-just-a-few-layers
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Tue, Mar 13, 2018 at 1:00 PM, 이경준 <player...@gmail.com > 
> wrote:
>
>> https://github.com/tesseract-ocr/tesseract/issues/549
>>
>>
>>
>> @harinath141 <https://github.com/harinath141> If you are getting a lot 
>> of these errors during finetune, try replace top layer training. You can 
>> use the box/tiff pairs generated for finetune. Commands will be similar to 
>> the following:
>>
>> mkdir -p ~/tesstutorial/tellayer_from_tel 
>>
>> combine_tessdata -e ../tessdata/tel.traineddata \
>>   ~/tesstutorial/tellayer_from_tel/tel.lstm
>>   
>> lstmtraining -U ~/tesstutorial/tel/tel.unicharset \
>>   --script_dir ../langdata  --debug_interval 0 \
>>   --continue_from ~/tesstutorial/tellayer_from_tel/tel.lstm \
>>   --append_index 5 --net_spec '[Lfx256 O1c105]' \
>>   --model_output ~/tesstutorial/tellayer_from_tel/tellayer \
>>   --train_listfile ~/tesstutorial/tel/tel.training_files.txt \
>>   --target_error_rate 0.01
>>
>>
>>
>> I found the article you wrote
>>
>>
>> but --script_dir doesn't work in the lstmtraining ? 
>>
>>
>> How do I change this option(flag) ??? what is replaced by that phrase 
>>
>>
>> 2018년 3월 13일 화요일 오후 4시 24분 52초 UTC+9, shree 님의 말:
>>>
>>> That info is given in the training wiki page.
>>>
>>> On Tue 13 Mar, 2018, 12:53 PM 이경준, <player...@gmail.com> wrote:
>>>
>>>> There is no way about replacing top layer ... ㅜㅜ 
>>>>
>>>> 2018년 3월 13일 화요일 오후 4시 22분 8초 UTC+9, shree 님의 말:
>>>>>
>>>>> https://github.com/tesseract-ocr/tesseract/issues/1009
>>>>>
>>>>> Link works ok
>>>>>
>>>>> On Tue 13 Mar, 2018, 12:37 PM 이경준, <player...@gmail.com> wrote:
>>>>>
>>>>>> Shreeshrii <https://github.com/Shreeshrii> commented on 29 Jun 2017 
>>>>>> <https://github.com/tesseract-ocr/tesseract/issues/1012#issuecomment-311892286>
>>>>>>  • 
>>>>>> edited 
>>>>>>
>>>>>> I think this happens when the complex characters in your training 
>>>>>> text are not part of the original Korean Unicharset that the 
>>>>>> 4.00.00alpha 
>>>>>> kor.traineddata was trained with.
>>>>>>
>>>>>> Do 'replace top layer' training instead of finetune. @abhishekchopde 
>>>>>> <https://github.com/abhishekchopde> has had good results with it - 
>>>>>> see #1009 <https://github.com/tesseract-ocr/tesseract/issues/1009>
>>>>>>
>>>>>> It will take longer than finetuning.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi shree I have a question ... you uploade this passage . But this 
>>>>>> link is not right . plz check again 
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/2878cbf6-a064-4fe5-ab5c-cfcd54248e9e%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/2878cbf6-a064-4fe5-ab5c-cfcd54248e9e%40googlegroups.com?utm_medium=email_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this gro

Re: [tesseract-ocr] How to replace top LSTM top layer ?

2018-03-13 Thread 이경준


https://github.com/tesseract-ocr/tesseract/issues/549



@harinath141 <https://github.com/harinath141> If you are getting a lot of 
these errors during finetune, try replace top layer training. You can use 
the box/tiff pairs generated for finetune. Commands will be similar to the 
following:

mkdir -p ~/tesstutorial/tellayer_from_tel 

combine_tessdata -e ../tessdata/tel.traineddata \
  ~/tesstutorial/tellayer_from_tel/tel.lstm
  
lstmtraining -U ~/tesstutorial/tel/tel.unicharset \
  --script_dir ../langdata  --debug_interval 0 \
  --continue_from ~/tesstutorial/tellayer_from_tel/tel.lstm \
  --append_index 5 --net_spec '[Lfx256 O1c105]' \
  --model_output ~/tesstutorial/tellayer_from_tel/tellayer \
  --train_listfile ~/tesstutorial/tel/tel.training_files.txt \
  --target_error_rate 0.01



I found the article you wrote


but --script_dir doesn't work in the lstmtraining ? 


How do I change this option(flag) ??? what is replaced by that phrase 


2018년 3월 13일 화요일 오후 4시 24분 52초 UTC+9, shree 님의 말:
>
> That info is given in the training wiki page.
>
> On Tue 13 Mar, 2018, 12:53 PM 이경준, <player...@gmail.com > 
> wrote:
>
>> There is no way about replacing top layer ... ㅜㅜ 
>>
>> 2018년 3월 13일 화요일 오후 4시 22분 8초 UTC+9, shree 님의 말:
>>>
>>> https://github.com/tesseract-ocr/tesseract/issues/1009
>>>
>>> Link works ok
>>>
>>> On Tue 13 Mar, 2018, 12:37 PM 이경준, <player...@gmail.com> wrote:
>>>
>>>> Shreeshrii <https://github.com/Shreeshrii> commented on 29 Jun 2017 
>>>> <https://github.com/tesseract-ocr/tesseract/issues/1012#issuecomment-311892286>
>>>>  • 
>>>> edited 
>>>>
>>>> I think this happens when the complex characters in your training text 
>>>> are not part of the original Korean Unicharset that the 4.00.00alpha 
>>>> kor.traineddata was trained with.
>>>>
>>>> Do 'replace top layer' training instead of finetune. @abhishekchopde 
>>>> <https://github.com/abhishekchopde> has had good results with it - see 
>>>> #1009 <https://github.com/tesseract-ocr/tesseract/issues/1009>
>>>>
>>>> It will take longer than finetuning.
>>>>
>>>>
>>>>
>>>> Hi shree I have a question ... you uploade this passage . But this link 
>>>> is not right . plz check again 
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/2878cbf6-a064-4fe5-ab5c-cfcd54248e9e%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/2878cbf6-a064-4fe5-ab5c-cfcd54248e9e%40googlegroups.com?utm_medium=email_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/d94d0cc3-79f0-4a6e-9cee-92b616424459%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/d94d0cc3-79f0-4a6e-9cee-92b616424459%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7ba3c6fe-c66d-428d-95ee-aed8e149c6b9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Training tesseract 4.0 with large training text

2018-03-13 Thread 이경준
Thank U 

2018년 3월 13일 화요일 오후 4시 21분 23초 UTC+9, shree 님의 말:
>
> You have to look in the file called by it
>
>
> tesstrain_utils.sh
>
> On Tue 13 Mar, 2018, 12:22 PM 이경준, <player...@gmail.com > 
> wrote:
>
>> Hi Shree . I saw the tesstrain.sh file.
>>
>> But I cannot point to max-pages to 3 ??? where ??? 
>>
>> Could you tell me about it more details
>>
>> 2018년 3월 13일 화요일 오전 10시 57분 29초 UTC+9, shree 님의 말:
>>>
>>> Please look at tesstrain.sh
>>>
>>> It is setting max-pages to 3 for text2image invocation. You can change 
>>> it there.
>>>
>>> On Tue 13 Mar, 2018, 6:54 AM , <john.d...@gmail.com> wrote:
>>>
>>>> Dear all,
>>>>
>>>> I'm trying to train lstm using a large training text, different fonts, 
>>>> colors etc. I'm trying to use text2image to generate my tif / box file 
>>>> combinations, however text2image appears to be limited to 3 pages and thus 
>>>> truncates my training text. How should I solve this? Call text2image in a 
>>>> loop on the remaining training text and generate hundreds, if not 
>>>> thousands, of tif / box file combos for all of my training text, fonts etc?
>>>>
>>>> Thanks for the help!
>>>>
>>>> John.
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/10bc983a-83a5-4434-afca-18cc2d5d1ce4%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/10bc983a-83a5-4434-afca-18cc2d5d1ce4%40googlegroups.com?utm_medium=email_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/9aee7e1d-45b8-46cc-bbcf-b62fb1db0fc4%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/9aee7e1d-45b8-46cc-bbcf-b62fb1db0fc4%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/872fb0c4-33a6-4e77-a371-8db8514ae5d7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to replace top LSTM top layer ?

2018-03-13 Thread 이경준
There is no way about replacing top layer ... ㅜㅜ 

2018년 3월 13일 화요일 오후 4시 22분 8초 UTC+9, shree 님의 말:
>
> https://github.com/tesseract-ocr/tesseract/issues/1009
>
> Link works ok
>
> On Tue 13 Mar, 2018, 12:37 PM 이경준, <player...@gmail.com > 
> wrote:
>
>> Shreeshrii <https://github.com/Shreeshrii> commented on 29 Jun 2017 
>> <https://github.com/tesseract-ocr/tesseract/issues/1012#issuecomment-311892286>
>>  • 
>> edited 
>>
>> I think this happens when the complex characters in your training text 
>> are not part of the original Korean Unicharset that the 4.00.00alpha 
>> kor.traineddata was trained with.
>>
>> Do 'replace top layer' training instead of finetune. @abhishekchopde 
>> <https://github.com/abhishekchopde> has had good results with it - see 
>> #1009 <https://github.com/tesseract-ocr/tesseract/issues/1009>
>>
>> It will take longer than finetuning.
>>
>>
>>
>> Hi shree I have a question ... you uploade this passage . But this link 
>> is not right . plz check again 
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/2878cbf6-a064-4fe5-ab5c-cfcd54248e9e%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/2878cbf6-a064-4fe5-ab5c-cfcd54248e9e%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d94d0cc3-79f0-4a6e-9cee-92b616424459%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] How to replace top LSTM top layer ?

2018-03-13 Thread 이경준
Shreeshrii  commented on 29 Jun 2017 

 • 
edited 

I think this happens when the complex characters in your training text are 
not part of the original Korean Unicharset that the 4.00.00alpha 
kor.traineddata was trained with.

Do 'replace top layer' training instead of finetune. @abhishekchopde 
 has had good results with it - see #1009 


It will take longer than finetuning.



Hi shree I have a question ... you uploade this passage . But this link is 
not right . plz check again 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2878cbf6-a064-4fe5-ab5c-cfcd54248e9e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Training tesseract 4.0 with large training text

2018-03-13 Thread 이경준
Hi Shree . I saw the tesstrain.sh file.

But I cannot point to max-pages to 3 ??? where ??? 

Could you tell me about it more details

2018년 3월 13일 화요일 오전 10시 57분 29초 UTC+9, shree 님의 말:
>
> Please look at tesstrain.sh
>
> It is setting max-pages to 3 for text2image invocation. You can change it 
> there.
>
> On Tue 13 Mar, 2018, 6:54 AM ,  wrote:
>
>> Dear all,
>>
>> I'm trying to train lstm using a large training text, different fonts, 
>> colors etc. I'm trying to use text2image to generate my tif / box file 
>> combinations, however text2image appears to be limited to 3 pages and thus 
>> truncates my training text. How should I solve this? Call text2image in a 
>> loop on the remaining training text and generate hundreds, if not 
>> thousands, of tif / box file combos for all of my training text, fonts etc?
>>
>> Thanks for the help!
>>
>> John.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/10bc983a-83a5-4434-afca-18cc2d5d1ce4%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9aee7e1d-45b8-46cc-bbcf-b62fb1db0fc4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: I do not include 'chi_tra' in my tessdata folder . What is it ? I have seen language-specific.sh

2018-03-11 Thread 이경준
Thank you for replying my questions. Thank you

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ff69c080-48eb-4af3-9166-da21e22b5f2c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] I do not include 'chi_tra' in my tessdata folder . What is it ? I have seen language-specific.sh

2018-03-10 Thread 이경준
Sorry ... I just want to know tesseract4.0 sorry 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8f232574-7291-4798-af73-b1f2690bcf89%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] I do not include 'chi_tra' in my tessdata folder . What is it ? I have seen language-specific.sh

2018-03-09 Thread 이경준
Hi i'm sorry to question oftenly. and lots of questions.

But, I must use tesseract 4.0 for my business .

plz understand my situations. I have lots of family to raise.


ealier you gave me *a bash sciprt *. In there *tesstrain.sh* (course) . it 
give me an error like 


Please make sure the TESSDATA_PREFIX environment variable is set to your 
"tessdata" directory.
*Failed loading language 'chi_tra'*
Error opening data file 
/usr/share/tesseract-ocr/4.00/tessdata/chi_tra.traineddata


*Before, you gave me a conference . it froms the lang directory / 
kor.config.*





*in there #Fixes 
https://github.com/tesseract-ocr/tesseract/issues/1009preserve_interword_spaces 
1tessedit_load_sublangs chi_tra# New Segmentation search params*


So I guess "tessedit_load_sublangs chi_tra" cause to error for executing 
"tesstrain.sh"

So I conclude(for solution) *1) Delete that sentence -> Is it right ? or 
what is the side-effect*


   I want to have 1 traineddata which is fine tuned and for 2 langugages 
(korean & English)

  so is it possible to add the sentece like  *1-1)*

*"tessedit_load_sublangs eng"-> Is it right? or possible???*
*In conclusion *
*1)*


*I do not want to see like error " Please make sure the TESSDATA_PREFIX 
environment variable is set to your "tessdata" directory.Failed loading 
language 'chi_tra'Error opening data file 
/usr/share/tesseract-ocr/4.00/tessdata/chi_tra.traineddata " *

*2) If I want to use tessereract(4.0) for 2 languages(e.g. Korean, English) 
by 1(one) traineddata(which is fine tuned) *

Is it possible and How to make 1 finedtuned traineddata for 2 languages(e.g 
Korean, English) 

3) tesseract is possible to use like 

$ tesseract (picture.png) -l kor+eng

is it possible ? 

4) What is kor.vert traineddata ? (tessdata-best) 

What is different from kor.traineddata ??? 

5) Is it possible to fine tune by existing images??? How is it possible to 
use script you gave me 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5427cba9-411f-42fa-91a0-989d983a3694%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] tesseract (4.0) criterion

2018-03-09 Thread 이경준
Thank U. But, I think that you my question is understoody missingly.

I know the way you explained. 

I mean that. Tessconfigs folder , configs folder . They are originally made 
after installing tesseract 4.0 

But, I  don't ... so I have a question for U .

I format my computer and install tesseract 4.0

this problem is solved .

I think that my ubuntu was strewed. ㅜㅜ

2018년 3월 9일 금요일 오후 6시 10분 51초 UTC+9, shree 님의 말:
>
> From the wiki, home page
>
> Various types of training data can be found on GitHub 
> <https://github.com/tesseract-ocr/>. Unpack and copy the .traineddata 
> file into a 'tessdata' directory. The exact directory will depend both on 
> the type of training data, and your Linux distribtion. Possibilities are 
> /usr/share/tesseract-ocr/tessdata or /usr/share/tessdata or 
> /usr/share/tesseract-ocr/4.00/tessdata.
>
>
> On Fri 9 Mar, 2018, 2:16 PM 이경준, <player...@gmail.com > 
> wrote:
>
>> Hi . I have a question
>>
>> after installing tesseract 4.0
>>
>> In the tessdata folder , 
>>
>> we don't have folder (=tessconfig, cofig) 
>>
>> ㅜㅜ 
>>
>> I see the situation .. right now my computer ubuntu 16.04. I installed 
>> tesseract 4.0 by package .. ㅜㅜ
>>
>> what problem
>>
>> 1. ppa point
>> 2. install package tesseract-ocr
>>
>> plz ㅜㅜㅜ help me .
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/99dc498e-1308-4c95-aa12-67aaa159aa97%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/99dc498e-1308-4c95-aa12-67aaa159aa97%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a00d2b20-9a54-41c6-8343-7b411c5fd88a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] tesseract (4.0) criterion

2018-03-09 Thread 이경준
Hi . I have a question

after installing tesseract 4.0

In the tessdata folder , 

we don't have folder (=tessconfig, cofig) 

ㅜㅜ 

I see the situation .. right now my computer ubuntu 16.04. I installed 
tesseract 4.0 by package .. ㅜㅜ

what problem

1. ppa point
2. install package tesseract-ocr

plz ㅜㅜㅜ help me .

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/99dc498e-1308-4c95-aa12-67aaa159aa97%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: reinstall problem - tesseract 4

2018-03-08 Thread 이경준


<https://lh3.googleusercontent.com/-HSeP4TkMdIg/WqITKqNZBGI/M00/pOUzXuGyGZcHs1HNs_2vORk-oQvMU1lyQCLcBGAs/s1600/tessdata.png>

<https://lh3.googleusercontent.com/-HSeP4TkMdIg/WqITKqNZBGI/M00/pOUzXuGyGZcHs1HNs_2vORk-oQvMU1lyQCLcBGAs/s1600/tessdata.png>


2018년 3월 9일 금요일 오후 1시 43분 27초 UTC+9, 이경준 님의 말:
>
> I remove tesseract (4.00)
>
> and I reinstall my computer(Ubuntu)
>
> but I cannot find *config folder *, *tesscofig folder* in tessdata ... 
>
> I have never seen before .. ㅜㅜㅜ 
>
> plz help me ... 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d34b0537-bcbc-4a00-8a72-7d5bb621ea5e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] reinstall problem - tesseract 4

2018-03-08 Thread 이경준
I remove tesseract (4.00)

and I reinstall my computer(Ubuntu)

but I cannot find *config folder *, *tesscofig folder* in tessdata ... 

I have never seen before .. ㅜㅜㅜ 

plz help me ... 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/af775f60-7206-4342-bc79-f4f40260cd20%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: @shree / Fianlly I made the customzied (fine tuned) traineddata

2018-03-08 Thread 이경준
(plus) you mean that In the "fine tuning " , making tuned traineddata .. 

kor.config affect the course?? 

I really really want your answer .. ㅜㅜ 

*I totally newly make existing trainedata(=which is fine tuned) ?*

Thank U 

2018년 3월 9일 금요일 오전 8시 54분 23초 UTC+9, 이경준 님의 말:
>
> sorry I delete my files
>
> 2018년 3월 8일 목요일 오후 6시 3분 13초 UTC+9, marco atzeri 님의 말:
>>
>> On 08/03/2018 09:58, 이경준 wrote: 
>> > This is my finely tuned traineddata (3types) 
>> > 
>> > my os environment is Ubuntu 16.04.03 LTS 
>>
>> Please avoid to put large attachment on any mailing list. 
>>
>> If you need to share, upload it somewhere else and shere the link 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ad90a4a6-d5d7-4cde-8255-feb84be905b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: @shree / Fianlly I made the customzied (fine tuned) traineddata

2018-03-08 Thread 이경준
sorry I delete my files

2018년 3월 8일 목요일 오후 6시 3분 13초 UTC+9, marco atzeri 님의 말:
>
> On 08/03/2018 09:58, 이경준 wrote: 
> > This is my finely tuned traineddata (3types) 
> > 
> > my os environment is Ubuntu 16.04.03 LTS 
>
> Please avoid to put large attachment on any mailing list. 
>
> If you need to share, upload it somewhere else and shere the link 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/54eee904-6a89-4438-a8bd-c1c811c55e60%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] @shree / Fianlly I made the customzied (fine tuned) traineddata

2018-03-08 Thread 이경준
Thank you . answering my question.. yesterday. I was sick .. sorry ㅜㅜ

Shree, Okay. I checked the kor.config. 

1) But, you mean that you could delete sub-lang - chi_tra option ??? 

 2) i don't know exactly what is the meaning of 'The langdata files are 
from 3.04 " 

I'm currently using tesseract 4.0 . ㅜㅜ

Thank U 

2018년 3월 8일 목요일 오후 6시 6분 23초 UTC+9, shree 님의 말:
>
> Please look at the kor.config file in langdata. Maybe it is loading chi_tra
>
> The langdata files r from 3.04
>
> On Thu 8 Mar, 2018, 2:27 PM 이경준, <player...@gmail.com > 
> wrote:
>
>> Hi 
>>
>> Fianlly I made the customzied (fine tuned) traineddata - korean
>>
>>
>> But, Run tesseract 
>>
>> I have a problem. 
>>
>> *Please make sure the TESSDATA_PREFIX environment variable is set to your 
>> "tessdata" directory.*
>>
>> *Failed loading language 'chi_tra'*
>>
>> I don't know why .
>>
>> In my tessdata_directory . I don't have chi_tra.traineddata .. ㅜㅜㅜ
>>
>> I failed to make traineddata? 
>>
>> my traineddata is worng??? 
>>
>>
>> I really really want to ur reply ...
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/9e1c741a-8b04-4364-a31f-c6f63ab29ad8%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/9e1c741a-8b04-4364-a31f-c6f63ab29ad8%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3782c9ed-e706-4242-b9e5-dddf26ec0fb5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] @shree / Fianlly I made the customzied (fine tuned) traineddata

2018-03-08 Thread 이경준
Hi 

Fianlly I made the customzied (fine tuned) traineddata - korean


But, Run tesseract 

I have a problem. 

*Please make sure the TESSDATA_PREFIX environment variable is set to your 
"tessdata" directory.*

*Failed loading language 'chi_tra'*

I don't know why .

In my tessdata_directory . I don't have chi_tra.traineddata .. ㅜㅜㅜ

I failed to make traineddata? 

my traineddata is worng??? 


I really really want to ur reply ...

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9e1c741a-8b04-4364-a31f-c6f63ab29ad8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] @Shree //// I have a question about Making a Traineddata which is finely tunned.

2018-03-04 Thread 이경준
Plud is plus, additionally

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2970948d-a9b2-4745-b890-fa3fef241a33%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] What is difference between unicharset and lstm-unicharset ?

2018-03-01 Thread 이경준

Hi . Thank you for seeing my questions 

1. What is difference between 'unicharset' and 'lstm-unicharset' ?  

I know to make 'unicharset' by command line : "$ tesseract 
(lang).(filename).exp(num).tif  (lang).(filename).exp(num).box

But I don't know to make 'lstm-unicharset'  ???

cf) .tr -> .lstmf

I apply this command line = "$tesseract (lang).(filename).exp(num).tif 
(lang).(filename).exp(num) nobatch *box.train*" to tesseract 
(lang).(filename).exp(num).tif (lang).(filename).exp(num) nobatch* 
lstm.train*"

2. This usage is right? 

Is it possible to apply 'unicharset' to 'lstm-unicharset'


I wait everybody's answers

Thank U. Have a nice day!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/72594ac2-fea6-47f2-9e49-ddde68952bb0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: I'm reading Using tesstrain (tesseract 4.0) wiki passage _ I have a question

2018-03-01 Thread 이경준
Oh. I know ㅜㅜㅜ Thank u  I was really impressd by U 

OK. Thank you very much 

Last question ... I can not understand .. trainned  data type

Your saying means that in the tesseract 4.0 / tessdata_best is better than 
tessdata  // ㅜㅜㅜ 

what is the tessdata_fast  ㅜㅜ  Fast integer versions of trained 
models

ㅜㅜ Sorry ㅜㅜㅜ ㅜplz help me ...
ㅜㅜ

2018년 3월 1일 목요일 오후 10시 10분 18초 UTC+9, shree 님의 말:
>
> >  I would to make a  customized and trainned "New trainneddata" 
>
> OK. But training from scratch takes a lot of time. I assume that you want 
> to finetune.
>
> Please note that the traineddata files in tessdata and tessdata_best and 
> tessdata_fast are NOT compatible. So, it depends on what version of 
> tesseract program you are using.
>
> I have already  sent you the bash script that you can modify for 
> training.  
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Thu, Mar 1, 2018 at 6:36 PM, ShreeDevi Kumar  > wrote:
>
>> > combine_tessdata -u kor.traineddata What is that meaning ? Could you 
>> explain for me ? 
>>
>> That command will show and unpack the components of your traineddata 
>> file. 
>>
>> eg. from tesdata_fast
>>
>> combine_tessdata -u ./tessdata_fast/kor.traineddata ./tessdata_fast/kor.
>> Extracting tessdata components from ./tessdata_fast/kor.traineddata
>> Wrote ./tessdata_fast/kor.config
>> Wrote ./tessdata_fast/kor.lstm
>> Wrote ./tessdata_fast/kor.lstm-punc-dawg
>> Wrote ./tessdata_fast/kor.lstm-word-dawg
>> Wrote ./tessdata_fast/kor.lstm-number-dawg
>> Wrote ./tessdata_fast/kor.lstm-unicharset
>> Wrote ./tessdata_fast/kor.lstm-recoder
>> Wrote ./tessdata_fast/kor.version
>> Version 
>> string:4.00.00alpha:kor:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx384O1c1]
>> 0:config:size=90, offset=192
>> 17:lstm:size=973837, offset=282
>> 18:lstm-punc-dawg:size=2602, offset=974119
>> 19:lstm-word-dawg:size=605274, offset=976721
>> 20:lstm-number-dawg:size=74, offset=1581995
>> 21:lstm-unicharset:size=76228, offset=1582069
>> 22:lstm-recoder:size=19034, offset=1658297
>> 23:version:size=80, offset=1677331
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/633868d4-5943-46a5-b584-1a32a89131b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: I'm reading Using tesstrain (tesseract 4.0) wiki passage _ I have a question

2018-03-01 Thread 이경준
And additonal question 

combine_tessdata -u kor.traineddata 

What is that "-u" what is that meaning ?? 

I can not find that option(flag) .. wiki - github page

Could you give me a explanation

2018년 2월 28일 수요일 오후 4시 21분 17초 UTC+9, 이경준 님의 말:
>
> Hi I'm studying this passage. But I cannot understand  what is that 
> meaning flag "--noextract_font_properties" ? . so I saw the file 
> /tesseract/training/tesstrain.sh  
>
> But I cannot Find "--noextract_font_properites"
>
> Here usage : 
>
> # USAGE:
> #
> # tesstrain.sh
> #--fontlist FONTS   # A list of fontnames to train on.
> #--fonts_dir FONTS_PATH # Path to font files.
> #--lang LANG_CODE   # ISO 639 code.
> #--langdata_dir DATADIR # Path to tesseract/training/langdata 
> directory.
> #--output_dir OUTPUTDIR # Location of output traineddata file.
> #--overwrite# Safe to overwrite files in output_dir.
> #--linedata_only# Only generate training data for 
> lstmtraining.
> #--run_shape_clustering # Run shape clustering (use for Indic 
> langs).
> #--exposures EXPOSURES  # A list of exposure levels to use (e.g. 
> "-1 0 1").
> #
> # OPTIONAL flags for input data. If unspecified we will look for them in
> # the langdata_dir directory.
> #--training_text TEXTFILE   # Text to render and use for training.
> #--wordlist WORDFILE# Word list for the language ordered by
> #   # decreasing frequency.
> #
> # OPTIONAL flag to specify location of existing traineddata files, required
> # during feature extraction. If unspecified will use TESSDATA_PREFIX 
> defined in
> # the current environment.
> #--tessdata_dir TESSDATADIR # Path to tesseract/tessdata directory.
> #
> # NOTE:
> # The font names specified in --fontlist need to be recognizable by Pango 
> using
> # fontconfig. An easy way to list the canonical names of all fonts 
> available on
> # your system is to run text2image with --list_available_fonts and the
> # appropriate --fonts_dir path.
>
>
>
>
>
>
> Using tesstrain
>
> The setup for running tesstrain.sh 
> <https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-tesstrain.sh>
>  is 
> the same as for base Tesseract. Use --linedata_onlyoption for LSTM 
> training. Note that it is beneficial to have more training text and make 
> more pages though, as neural nets don't generalize as well and need to 
> train on something similar to what they will be running on. If the target 
> domain is severely limited, then all the dire warnings about needing a lot 
> of training data may not apply, but the network specification may need to 
> be changed.
>
> Training data is created using tesstrain.sh 
> <https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain.sh>
>  as 
> follows: Note that your fonts location may vary.
>
> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only 
> \
>   --noextract_font_properties --langdata_dir ../langdata \
>   --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
>
>
>
> Thank U Very much . I want to reply Everybody
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/97c9dc09-68bd-4c7f-ad2a-4455109d4d6d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: I'm reading Using tesstrain (tesseract 4.0) wiki passage _ I have a question

2018-03-01 Thread 이경준
No. I'm really Sorry about complaining about tesseract(4.0) 

I mean that tesseract is great , but is not perfect(100%) 

I think that Tesseract is fairy good.

But, I have a clue about customizing and using Tesseract(4.0) rightfully

Thank U.

At first I know the trainneddata is 3 types

tessdata / tessdata_fast/ tessdata_best /

so, I used tessdata - kor.trainneddata to check tesseract(4.0) test ..

and you give me a right answer about using tesseract .(another questions I 
wrote to you)

<https://lh3.googleusercontent.com/-LJk-KTYwO8M/Wpf00BT2O5I/Mw0/3VfqA5FPzd8Ued9xBWcF1VrvNKm9NKz5wCLcBGAs/s1600/0301_2.png>


But, you suggest use tessdata_fast ? I have no idea .. ㅜㅜㅜ

And. once upon a time I used tessdata_best / tessdata_fast /tessdata 3 type 
use respectively,

But correctness is the best data is tessdata .. 


<https://lh3.googleusercontent.com/-_PSt4nXFW2U/Wpf1mHFHAiI/MxA/zPfyMZG-j14kqwb85eOR0ALF3TwfvSm2gCLcBGAs/s1600/0301_1.png>


In the github page descption says that tessdata_best is best accuracy in 3 
types tessdata...

But in actual is not right .ㅜㅜ Correct rate is fairy different tessdata 
from tessdata_best

So I use the tessdata . What makes me give me a that result

So. your last (recently) answer & sugestion  = in this passage 

you give me a suggestion  that in the command line to type " 
combine_tessdata -u kor.traineddata 

What is that meaning ? Could you explain for me ? 

(in short) 

1. I want to tesseract 4.0 using rightly in my business

so i have a plan using trainneddata.

 case  #1
if  already made and uploaded github page - 3 type trainned data( tessdata 
/ tessdata_best / tessdata_fast) is  not good enough to use Korean in my 
business

 I would to make a  customized and trainned "New trainneddata"

But how can i make a decsion about that / I can make a treshold to help a 
descision  


Plz help me  ㅜㅜ 


2018년 3월 1일 목요일 오후 4시 17분 41초 UTC+9, shree 님의 말:
>
> >we don't understand each otehr saying.
>
> Sorry about that.
>
> Please give the following commands and let me know the result.
>
> tesseract -v
>
> tesseract --list-langs
>
> combine_tessdata -u kor.traineddata
>
> I do not know Korean, but feedback from other users has been that 
> tesseract4 and the latest trainedadata give good results.
>
>
> >your last command line means that install language pack in tessdata 
> directory - kor.traineddata 
>
> >I want to say I use that way. but, my test image recognizision rate is 
> not enough to use the business . 
>
>
> There are three sets of traineddata files, in tessdata, tessdata_best and 
> tessdata_fast repositories on github. 
>
> I was suggesting that you use the ones from tessdata_fast, which are 
> packaged by AlexanderP for older versions of Linux along with the latest 
> version of the programs.
>
> I am attaching a test image and the results that I get using Tesseract. To 
> me the accuracy looks good, whether that is acceptable for use in business 
> is a decision you have to make.
>
>
>
>
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Thu, Mar 1, 2018 at 11:30 AM, 이경준 <player...@gmail.com > 
> wrote:
>
>> Thank U . for advice
>>
>> I have never installed tesseract (3.0x) 
>>
>> I have a question 
>>
>> your last command line means that install language pack in tessdata 
>> directory - kor.traineddata 
>>
>> Am I false.
>>
>> I want to say I use that way. but, my test image recognizision rate is 
>> not enough to use the business . 
>>
>> we don't understand each otehr saying.
>>
>> Thank u 
>>
>> 2018년 3월 1일 목요일 오후 2시 30분 11초 UTC+9, shree 님의 말:
>>
>>> > my system is operated by Ubuntu 16.04. 03 LTS 
>>>
>>> > Yes .I tried tessdata - kor.trainnedata /// But it is not good 
>>> enough. sorry .ㅜㅜ i can not use tesseract 4.0 tessdata-kor.trainnedata. in 
>>> bussiness .. 
>>>
>>> I will suggest that you uninstall your old tesseract version.(3.0x)
>>>
>>>
>>> sudo apt-get remove tesseract-ocr
>>>
>>>
>>> and then install tesseracr4.00 version from the PPA provided by 
>>> AlexanderP
>>>
>>> sudo add-apt-repository ppa:alex-p/tesseract-ocr
>>> sudo apt-get update
>>>
>>> sudo apt-get install tesseract-ocr
>>>
>>> sudo apt-get install tesseract-ocr-kor
>>>
>>>
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>&g

[tesseract-ocr] Re: bash script to help finetune training for Korean

2018-03-01 Thread 이경준
Thank U I really really appreicate for your kindness. 

Thank U 

2018년 3월 1일 목요일 오후 4시 37분 51초 UTC+9, shree 님의 말:
>
> The log file sent earlier was only for training steps. 
>
> Complete log file which shows output on console during building of 
> training data using tesstrain.sh is attached now.
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Thu, Mar 1, 2018 at 12:58 PM, ShreeDevi Kumar  > wrote:
>
>> I am attaching  a bash script which makes it easy to give all the 
>> required commands for finetune training of a language. I recently tested it 
>> from the queries regarding Korean testing. Hopefully this will be easier to 
>> follow rather than descriptive text.
>>
>> While the commands currently are setup for korean, they can be changed 
>> for other languages easily.
>>
>> The directories etc need to be set based on your local setup.
>>
>> Please note: If using Indic languages and RTL languages, the 
>> combine_lang_model command will need additional variables to be set -
>>
>>  --lang_is_rtl  True if lang being processed is written right-to-left  
>> (type:bool default:false)
>>   --pass_through_recoder  If true, the recoder is a simple pass-through 
>> of the unicharset. Otherwise, potentially a compression of it  (type:bool 
>> default:false)
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c9b7a469-b5e7-454f-8ecf-e6f63c303535%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: I'm reading Using tesstrain (tesseract 4.0) wiki passage _ I have a question

2018-02-28 Thread 이경준
Thank U . for advice

I have never installed tesseract (3.0x) 

I have a question 

your last command line means that install language pack in tessdata 
directory - kor.traineddata 

Am I false.

I want to say I use that way. but, my test image recognizision rate is not 
enough to use the business . 

we don't understand each otehr saying.

Thank u 

2018년 3월 1일 목요일 오후 2시 30분 11초 UTC+9, shree 님의 말:
>
> > my system is operated by Ubuntu 16.04. 03 LTS 
>
> > Yes .I tried tessdata - kor.trainnedata /// But it is not good enough. 
> sorry .ㅜㅜ i can not use tesseract 4.0 tessdata-kor.trainnedata. in 
> bussiness .. 
>
> I will suggest that you uninstall your old tesseract version.(3.0x)
>
>
> sudo apt-get remove tesseract-ocr
>
>
> and then install tesseracr4.00 version from the PPA provided by AlexanderP
>
> sudo add-apt-repository ppa:alex-p/tesseract-ocr
> sudo apt-get update
>
> sudo apt-get install tesseract-ocr
>
> sudo apt-get install tesseract-ocr-kor
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f1de33d0-e0c4-4d65-88b4-57c92562ea8a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: I'm reading Using tesstrain (tesseract 4.0) wiki passage _ I have a question

2018-02-28 Thread 이경준
Yes .I tried tessdata - kor.trainnedata /// But it is not good enough. 
sorry .ㅜㅜ i can not use tesseract 4.0 tessdata-kor.trainnedata. in 
bussiness .. 

So I must train 4.00 kor ... Thank you for advice

2018년 3월 1일 목요일 오후 12시 59분 31초 UTC+9, shree 님의 말:
>
>
> On Thu, Mar 1, 2018 at 9:21 AM, 이경준 <player...@gmail.com > 
> wrote:
>
>> Thank U reply my question.
>>
>> But my system is operated by Ubuntu 16.04. 03 LTS 
>>
>> I think that that path is not working  ? Am I false? 
>>
>>
>> 2018년 2월 28일 수요일 오후 6시 18분 41초 UTC+9, shree 님의 말:
>>>
>>> Try with following - make sure that you change all variables with dir to 
>>> match your setup 
>>>
>>> tesstrain.sh \
>>>  --lang kor \
>>>  --noextract_font_properties \
>>>  --linedata_only \
>>> * --langdata_dir ../langdata \*
>>> * --tessdata_dir ../tessdata \*
>>> * --fonts_dir **/mnt/c/Windows/Fonts** \*
>>>  --fontlist \
>>>   "Arial Unicode MS" \
>>> * --output_dir ../tesstutorial/kor*
>>>
>>> The fontlist you specify in command will override the list in 
>>> language_specific.sh
>>>
>>>
>>>
> ​Tesseract4.00alpha gives good results for Korean recognition. Have you 
> tried that? You may not need to do training.
>
> If you want to do training for 4.00, you need files from langdata and 
> tessdata_​best.
>
> https://github.com/tesseract-ocr/langdata
> https://github.com/tesseract-ocr/ tessdata_​best
>
> see https://github.com/tesseract-ocr/langdata/blob/master/README.md
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/88825700-8da1-4fc7-be6e-1bccdf0848d5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] I have a qeustion about font_properties(tesseract 4.0)

2018-02-28 Thread 이경준
Hi I have a question about font_properties(tesseract 4.0) 


https://github.com/tesseract-ocr/langdata/blob/master/font_properties


(e.g) 
Baekmuk_Dotum 0 0 0 0 0

here digits means that   

is right ? 

I cite this words 
from 
https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.00%E2%80%933.02

Thank u for answering my question. I wait . Thank U




-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3e451c9e-0db8-461e-a154-bd0b8b9be2bf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] I have a question about making a traineddata (tesseract 4.0 LSTM)

2018-02-28 Thread 이경준
Hi 

I have a question about making a traineedata (tesseract 4.0 LSTM)

Tutorial Guide to lstmtraining 
Creating
 
Starter Traineddata

NOTE: This is a new step!

Instead of a unicharset and script_dir, lstmtraining now takes a traineddata 
file 
on its command-line, to obtain all the information it needs on the language 
to be learned. The traineddata *must* contain at least an lstm-unicharset
 and lstm-recoder component, and may also contain the three dawg files: 
lstm-punc-dawg 
lstm-word-dawg lstm-number-dawg A config file is also optional. The other 
components, if present, will be ignored and unused.

There is no tool to create the lstm-recoder directly. Instead there is a 
new tool, combine_lang_model which takes as input an input_unicharset and 
script_dir(script_dir points to the langdata directory) and optional word 
list files. It creates the lstm-recoder from the input_unicharset and 
creates all the dawgs, if wordlists are provided, putting everything 
together into a traineddata file.




above the passage  I could not find to make a 'lstm-unicharset' ... So 
I have no idea 


and. I have a 
question https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 


NOTE Tesseract 4.00 will now run happily with a traineddata file that 
contains *just* lang.lstm, lang.lstm-unicharset and lang.lstm-recoder. The 
lstm-*-dawgs are optional, and *none of the other components are required 
or used with OEM_LSTM_ONLY as the OCR engine mode.* No bigrams, unichar 
ambigs or any of the other components are needed or even have any effect if 
present. The only other component that does anything is the lang.config, 
which can affect layout analysis, and sub-languages.

If added to an existing Tesseract traineddata file, the lstm-unicharset doesn't 
have to match the Tesseract unicharset, but the same unicharset must be 
used to train the LSTM and build the lstm-*-dawgs files.




at the end of this wiki passage, trainned data is composed by 'lang.lstm, 
lang.lstm-unicharset, lang.lstm-recoder'(mandatory) /



but firstl `Creating Starter Traineddtat' passage says that trainned data 
is composed by 'lstm-recoder, lstm-unicharset(mandatory) /



Which is sentence is right? 


plz help me.



-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b1de73d9-8cfd-4f70-bcb9-f4dfccb79a9b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: I'm reading Using tesstrain (tesseract 4.0) wiki passage _ I have a question

2018-02-28 Thread 이경준
Thank U reply my question.

But my system is operated by Ubuntu 16.04. 03 LTS 

I think that that path is not working  ? Am I false? 


2018년 2월 28일 수요일 오후 6시 18분 41초 UTC+9, shree 님의 말:
>
> Try with following - make sure that you change all variables with dir to 
> match your setup 
>
> tesstrain.sh \
>  --lang kor \
>  --noextract_font_properties \
>  --linedata_only \
> * --langdata_dir ../langdata \*
> * --tessdata_dir ../tessdata \*
> * --fonts_dir **/mnt/c/Windows/Fonts** \*
>  --fontlist \
>   "Arial Unicode MS" \
> * --output_dir ../tesstutorial/kor*
>
> The fontlist you specify in command will override the list in 
> language_specific.sh
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5cf9466f-8ea0-46f5-b14e-df02ca2f3fe6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: I'm reading Using tesstrain (tesseract 4.0) wiki passage _ I have a question

2018-02-28 Thread 이경준
Sorry . But I have issue about korea 

you mentioned answer is applyed to english . But , it doesn't work korea

In the logs . Font error . But I refer to the /training/language-specific.sh

vi language-specific.sh 

Font list - kor _NeoLatin

so I install korean fonts in there .

and reboot 

but same result .. Is it possible to solve korean font issue? 

2018년 2월 28일 수요일 오후 4시 21분 17초 UTC+9, 이경준 님의 말:
>
> Hi I'm studying this passage. But I cannot understand  what is that 
> meaning flag "--noextract_font_properties" ? . so I saw the file 
> /tesseract/training/tesstrain.sh  
>
> But I cannot Find "--noextract_font_properites"
>
> Here usage : 
>
> # USAGE:
> #
> # tesstrain.sh
> #--fontlist FONTS   # A list of fontnames to train on.
> #--fonts_dir FONTS_PATH # Path to font files.
> #--lang LANG_CODE   # ISO 639 code.
> #--langdata_dir DATADIR # Path to tesseract/training/langdata 
> directory.
> #--output_dir OUTPUTDIR # Location of output traineddata file.
> #--overwrite# Safe to overwrite files in output_dir.
> #--linedata_only# Only generate training data for 
> lstmtraining.
> #--run_shape_clustering # Run shape clustering (use for Indic 
> langs).
> #--exposures EXPOSURES  # A list of exposure levels to use (e.g. 
> "-1 0 1").
> #
> # OPTIONAL flags for input data. If unspecified we will look for them in
> # the langdata_dir directory.
> #--training_text TEXTFILE   # Text to render and use for training.
> #--wordlist WORDFILE# Word list for the language ordered by
> #   # decreasing frequency.
> #
> # OPTIONAL flag to specify location of existing traineddata files, required
> # during feature extraction. If unspecified will use TESSDATA_PREFIX 
> defined in
> # the current environment.
> #--tessdata_dir TESSDATADIR # Path to tesseract/tessdata directory.
> #
> # NOTE:
> # The font names specified in --fontlist need to be recognizable by Pango 
> using
> # fontconfig. An easy way to list the canonical names of all fonts 
> available on
> # your system is to run text2image with --list_available_fonts and the
> # appropriate --fonts_dir path.
>
>
>
>
>
>
> Using tesstrain
>
> The setup for running tesstrain.sh 
> <https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-tesstrain.sh>
>  is 
> the same as for base Tesseract. Use --linedata_onlyoption for LSTM 
> training. Note that it is beneficial to have more training text and make 
> more pages though, as neural nets don't generalize as well and need to 
> train on something similar to what they will be running on. If the target 
> domain is severely limited, then all the dire warnings about needing a lot 
> of training data may not apply, but the network specification may need to 
> be changed.
>
> Training data is created using tesstrain.sh 
> <https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain.sh>
>  as 
> follows: Note that your fonts location may vary.
>
> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only 
> \
>   --noextract_font_properties --langdata_dir ../langdata \
>   --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
>
>
>
> Thank U Very much . I want to reply Everybody
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/48bce0e5-2467-4dbd-b5bb-fe47873a015e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] I'm reading Using tesstrain (tesseract 4.0) wiki passage _ I have a question

2018-02-27 Thread 이경준
Hi I'm studying this passage. But I cannot understand  what is that meaning 
flag "--noextract_font_properties" ? . so I saw the file 
/tesseract/training/tesstrain.sh  

But I cannot Find "--noextract_font_properites"

Here usage : 

# USAGE:
#
# tesstrain.sh
#--fontlist FONTS   # A list of fontnames to train on.
#--fonts_dir FONTS_PATH # Path to font files.
#--lang LANG_CODE   # ISO 639 code.
#--langdata_dir DATADIR # Path to tesseract/training/langdata 
directory.
#--output_dir OUTPUTDIR # Location of output traineddata file.
#--overwrite# Safe to overwrite files in output_dir.
#--linedata_only# Only generate training data for 
lstmtraining.
#--run_shape_clustering # Run shape clustering (use for Indic 
langs).
#--exposures EXPOSURES  # A list of exposure levels to use (e.g. 
"-1 0 1").
#
# OPTIONAL flags for input data. If unspecified we will look for them in
# the langdata_dir directory.
#--training_text TEXTFILE   # Text to render and use for training.
#--wordlist WORDFILE# Word list for the language ordered by
#   # decreasing frequency.
#
# OPTIONAL flag to specify location of existing traineddata files, required
# during feature extraction. If unspecified will use TESSDATA_PREFIX 
defined in
# the current environment.
#--tessdata_dir TESSDATADIR # Path to tesseract/tessdata directory.
#
# NOTE:
# The font names specified in --fontlist need to be recognizable by Pango 
using
# fontconfig. An easy way to list the canonical names of all fonts 
available on
# your system is to run text2image with --list_available_fonts and the
# appropriate --fonts_dir path.






Using tesstrain

The setup for running tesstrain.sh 

 is 
the same as for base Tesseract. Use --linedata_onlyoption for LSTM 
training. Note that it is beneficial to have more training text and make 
more pages though, as neural nets don't generalize as well and need to 
train on something similar to what they will be running on. If the target 
domain is severely limited, then all the dire warnings about needing a lot 
of training data may not apply, but the network specification may need to 
be changed.

Training data is created using tesstrain.sh 
 
as 
follows: Note that your fonts location may vary.

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain



Thank U Very much . I want to reply Everybody

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/05a54fa0-b5c0-48eb-b7a1-7db0fe8dfe81%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] I have a Question about Creating Traing Data

2018-02-27 Thread 이경준
Hi 
I'm KOREAN
I'm studying Tesseract 4.0 
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
This page is very useful to study tesseract 4.0

But, I'm poor at Reading English & Understanding Tesseract training 4.0 
In short, the next senentes cannot be understood by me.

*Creating Training Data*

As with base Tesseract, there is a choice between rendering synthetic 
training data from fonts, or labelling some pre-existing images (like 
ancient manuscripts for example). In either case, the required format is 
still the tiff/box file pair, except that the boxes only need to cover a 
textline instead of individual characters. 'Newline' boxes with tab as the 
character must be inserted between textlines to indicate the end-of-line. 
Multi-word boxes require a different box format, as the space would confuse 
the parser

I have no idea  Could you explain this sentence to me & I want to see 
the example of the box file /tiff (by tesseract 4.0) 

Thank you . 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a20fd0e3-b3ae-4ab2-9fa1-97b147fc86aa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.