Re: [tesseract-ocr] Can :traineddata" for Tesseract 3 be used for Tesseract 4

2018-06-14 Thread chandra churh chatterjee
How to convert the images as stated above into fonts for tesstrain.sh
command runnning which generates images files along with box and .lstmf
files?


On Thu, Jun 14, 2018 at 11:05 AM chandra churh chatterjee <
chandrachurh.chatterje...@gmail.com> wrote:

> can you tell me from which directory we have to run the following command
> and what will be the following arguments if we are using our trained data
> which contains files as follows:
> -07-2016 12:45 11 digits.f4.exp0.txt
> -a   08-07-2016 12:37198 digits.f5.exp0.box
> -a   08-07-2016 12:10  14044 digits.f5.exp0.jpg
> -a   08-07-2016 12:45  16309 digits.f5.exp0.tr
> -a   08-07-2016 12:45 11 digits.f5.exp0.txt
> -a   08-07-2016 12:31188 digits.f6.exp0.box
> -a   23-06-2016 13:06   9824 digits.f6.exp0.jpg
> -a   08-07-2016 12:45  17538 digits.f6.exp0.tr
> -a   08-07-2016 12:45 11 digits.f6.exp0.txt
> -a   08-07-2016 12:38199 digits.f7.exp0.box
> -a   08-07-2016 12:11  13178 digits.f7.exp0.jpg
> -a   08-07-2016 12:45  16019 digits.f7.exp0.tr
> -a   08-07-2016 12:45 11 digits.f7.exp0.txt
> -a   08-07-2016 12:38198 digits.f8.exp0.box
> -a   23-06-2016 13:06   9485 digits.f8.exp0.jpg
> -a   08-07-2016 12:45  17078 digits.f8.exp0.tr
> -a   08-07-2016 12:45 11 digits.f8.exp0.txt
> -a   08-07-2016 12:38199 digits.f9.exp0.box
> -a   08-07-2016 12:11  13411 digits.f9.exp0.jpg
> -a   08-07-2016 12:45  15916 digits.f9.exp0.tr
> -a   08-07-2016 12:45 11 digits.f9.exp0.txt
> -a   08-07-2016 12:57543 digits.font_properties
> -a   08-07-2016 12:59 184521 digits.inttemp
> -a   08-07-2016 13:00   4832 digits.normproto
> -a   08-07-2016 12:59 84 digits.pffmtable
> -a   08-07-2016 12:59   6520 digits.shapetable
> -a   08-07-2016 13:01 196755 digits.traineddata
> -a   08-07-2016 12:59658 digits.unicharset
> -a   08-07-2016 12:55648 unicharset
>
> how to convert these files and from where to run the command as sugested
> by you?
>
> On Wed, Jun 13, 2018 at 8:38 PM ShreeDevi Kumar 
> wrote:
>
>> If you have box tiff pairs in tesseract4 format you can generate the
>> lstmf files by running
>>
>> tesseract   lang.file.exp0.tif lang.file.exp0   lstm.train
>>
>> lstm.train is  a config file.
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>>
>> On Wed, Jun 13, 2018 at 6:46 PM chandra churh chatterjee <
>> chandrachurh.chatterje...@gmail.com> wrote:
>>
>>> I have trained tesseract 3 with 64 fonts using respective box and .tr
>>> files, But now i want to use the same trained data for training tesseract 4
>>> after creating the starter trained data using the "Using tesstrain
>>>
>>> The setup for running tesstrain.sh is the same as for base Tesseract.
>>> Use --linedata_only option for LSTM training. Note that it is
>>> beneficial to have more training text and make more pages though, as neural
>>> nets don't generalize as well and need to train on something similar to
>>> what they will be running on. If the target domain is severely limited,
>>> then all the dire warnings about needing a lot of training data may not
>>> apply, but the network specification may need to be changed.
>>>
>>> Training data is created using tesstrain.sh
>>> 
>>>  as
>>> follows: Note that your fonts location may vary.
>>>
>>> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng 
>>> --linedata_only \
>>>   --noextract_font_properties --langdata_dir ../langdata \
>>>   --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
>>>
>>> The above command makes LSTM training data equivalent to the data used
>>> to train base Tesseract for English. For making a general-purpose
>>> LSTM-based OCR engine, it is woefully inadequate, but makes a good tutorial
>>> demo.
>>>
>>> Now try this to make eval data for the 'Impact' font:
>>>
>>> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng 
>>> --linedata_only \
>>>   --noextract_font_properties --langdata_dir ../langdata \
>>>   --tessdata_dir ./tessdata \
>>>
>>>   --fontlist "Impact Condensed" --output_dir ~/tesstutorial/engeval"
>>>
>>>
>>>
>>> Now i want to proceed further using my previous trained data to do the
>>> training but the problem is that the previous trained data had .tr files
>>> and box files but tesseract 4 requires .lstmf 

Re: [tesseract-ocr] Can :traineddata" for Tesseract 3 be used for Tesseract 4

2018-06-13 Thread chandra churh chatterjee
can you tell me from which directory we have to run the following command
and what will be the following arguments if we are using our trained data
which contains files as follows:
-07-2016 12:45 11 digits.f4.exp0.txt
-a   08-07-2016 12:37198 digits.f5.exp0.box
-a   08-07-2016 12:10  14044 digits.f5.exp0.jpg
-a   08-07-2016 12:45  16309 digits.f5.exp0.tr
-a   08-07-2016 12:45 11 digits.f5.exp0.txt
-a   08-07-2016 12:31188 digits.f6.exp0.box
-a   23-06-2016 13:06   9824 digits.f6.exp0.jpg
-a   08-07-2016 12:45  17538 digits.f6.exp0.tr
-a   08-07-2016 12:45 11 digits.f6.exp0.txt
-a   08-07-2016 12:38199 digits.f7.exp0.box
-a   08-07-2016 12:11  13178 digits.f7.exp0.jpg
-a   08-07-2016 12:45  16019 digits.f7.exp0.tr
-a   08-07-2016 12:45 11 digits.f7.exp0.txt
-a   08-07-2016 12:38198 digits.f8.exp0.box
-a   23-06-2016 13:06   9485 digits.f8.exp0.jpg
-a   08-07-2016 12:45  17078 digits.f8.exp0.tr
-a   08-07-2016 12:45 11 digits.f8.exp0.txt
-a   08-07-2016 12:38199 digits.f9.exp0.box
-a   08-07-2016 12:11  13411 digits.f9.exp0.jpg
-a   08-07-2016 12:45  15916 digits.f9.exp0.tr
-a   08-07-2016 12:45 11 digits.f9.exp0.txt
-a   08-07-2016 12:57543 digits.font_properties
-a   08-07-2016 12:59 184521 digits.inttemp
-a   08-07-2016 13:00   4832 digits.normproto
-a   08-07-2016 12:59 84 digits.pffmtable
-a   08-07-2016 12:59   6520 digits.shapetable
-a   08-07-2016 13:01 196755 digits.traineddata
-a   08-07-2016 12:59658 digits.unicharset
-a   08-07-2016 12:55648 unicharset

how to convert these files and from where to run the command as sugested by
you?

On Wed, Jun 13, 2018 at 8:38 PM ShreeDevi Kumar 
wrote:

> If you have box tiff pairs in tesseract4 format you can generate the lstmf
> files by running
>
> tesseract   lang.file.exp0.tif lang.file.exp0   lstm.train
>
> lstm.train is  a config file.
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
>
> On Wed, Jun 13, 2018 at 6:46 PM chandra churh chatterjee <
> chandrachurh.chatterje...@gmail.com> wrote:
>
>> I have trained tesseract 3 with 64 fonts using respective box and .tr
>> files, But now i want to use the same trained data for training tesseract 4
>> after creating the starter trained data using the "Using tesstrain
>>
>> The setup for running tesstrain.sh is the same as for base Tesseract. Use
>> --linedata_only option for LSTM training. Note that it is beneficial to
>> have more training text and make more pages though, as neural nets don't
>> generalize as well and need to train on something similar to what they will
>> be running on. If the target domain is severely limited, then all the dire
>> warnings about needing a lot of training data may not apply, but the
>> network specification may need to be changed.
>>
>> Training data is created using tesstrain.sh
>> 
>>  as
>> follows: Note that your fonts location may vary.
>>
>> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng 
>> --linedata_only \
>>   --noextract_font_properties --langdata_dir ../langdata \
>>   --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
>>
>> The above command makes LSTM training data equivalent to the data used to
>> train base Tesseract for English. For making a general-purpose LSTM-based
>> OCR engine, it is woefully inadequate, but makes a good tutorial demo.
>>
>> Now try this to make eval data for the 'Impact' font:
>>
>> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng 
>> --linedata_only \
>>   --noextract_font_properties --langdata_dir ../langdata \
>>   --tessdata_dir ./tessdata \
>>
>>   --fontlist "Impact Condensed" --output_dir ~/tesstutorial/engeval"
>>
>>
>>
>> Now i want to proceed further using my previous trained data to do the
>> training but the problem is that the previous trained data had .tr files
>> and box files but tesseract 4 requires .lstmf files .
>> Requesting for any solution.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at 

Re: [tesseract-ocr] Can :traineddata" for Tesseract 3 be used for Tesseract 4

2018-06-13 Thread ShreeDevi Kumar
If you have box tiff pairs in tesseract4 format you can generate the lstmf
files by running

tesseract   lang.file.exp0.tif lang.file.exp0   lstm.train

lstm.train is  a config file.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Wed, Jun 13, 2018 at 6:46 PM chandra churh chatterjee <
chandrachurh.chatterje...@gmail.com> wrote:

> I have trained tesseract 3 with 64 fonts using respective box and .tr
> files, But now i want to use the same trained data for training tesseract 4
> after creating the starter trained data using the "Using tesstrain
>
> The setup for running tesstrain.sh is the same as for base Tesseract. Use
> --linedata_only option for LSTM training. Note that it is beneficial to
> have more training text and make more pages though, as neural nets don't
> generalize as well and need to train on something similar to what they will
> be running on. If the target domain is severely limited, then all the dire
> warnings about needing a lot of training data may not apply, but the
> network specification may need to be changed.
>
> Training data is created using tesstrain.sh
> 
>  as
> follows: Note that your fonts location may vary.
>
> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only 
> \
>   --noextract_font_properties --langdata_dir ../langdata \
>   --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
>
> The above command makes LSTM training data equivalent to the data used to
> train base Tesseract for English. For making a general-purpose LSTM-based
> OCR engine, it is woefully inadequate, but makes a good tutorial demo.
>
> Now try this to make eval data for the 'Impact' font:
>
> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only 
> \
>   --noextract_font_properties --langdata_dir ../langdata \
>   --tessdata_dir ./tessdata \
>
>   --fontlist "Impact Condensed" --output_dir ~/tesstutorial/engeval"
>
>
>
> Now i want to proceed further using my previous trained data to do the
> training but the problem is that the previous trained data had .tr files
> and box files but tesseract 4 requires .lstmf files .
> Requesting for any solution.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/f3d6c64e-7763-478e-b047-a64edd032d99%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWD0-BJ6sq4mypJhnc5FKudVcmSeBg%2BB5w5EARV4NPL4g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Can :traineddata" for Tesseract 3 be used for Tesseract 4

2018-06-13 Thread chandra churh chatterjee
I have trained tesseract 3 with 64 fonts using respective box and .tr 
files, But now i want to use the same trained data for training tesseract 4 
after creating the starter trained data using the "Using tesstrain

The setup for running tesstrain.sh is the same as for base Tesseract. Use 
--linedata_only option for LSTM training. Note that it is beneficial to 
have more training text and make more pages though, as neural nets don't 
generalize as well and need to train on something similar to what they will 
be running on. If the target domain is severely limited, then all the dire 
warnings about needing a lot of training data may not apply, but the 
network specification may need to be changed.

Training data is created using tesstrain.sh 

 as 
follows: Note that your fonts location may vary.

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain

The above command makes LSTM training data equivalent to the data used to 
train base Tesseract for English. For making a general-purpose LSTM-based 
OCR engine, it is woefully inadequate, but makes a good tutorial demo.

Now try this to make eval data for the 'Impact' font:

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata \ 

  --fontlist "Impact Condensed" --output_dir ~/tesstutorial/engeval"



Now i want to proceed further using my previous trained data to do the 
training but the problem is that the previous trained data had .tr files 
and box files but tesseract 4 requires .lstmf files .
Requesting for any solution.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f3d6c64e-7763-478e-b047-a64edd032d99%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.