Re: [tesseract-ocr] Re: Error when executing combine_lang_model script

2018-09-03 Thread Shree Devi Kumar
> Then I tried to create a starter traineddata file
using combine_lang_model script. I used the below command for that,

When you run tesstrain.sh, it creates the starter traineddata  using
combine_lang_model
script.

See below for messages from a small test run.

+ /home/ubuntu/tesseract/src/training/tesstrain.sh --fonts_dir ../.fonts
--lang sin --linedata_only --noextract_font_properties --langdata_dir
../langdata_lstm --tessdata_dir ../tessdata_best --fontlist FreeSerif
--training_text ../langdata_lstm/sin/sin.training_text --workspace_dir
/home/ubuntu/tmp/ --save_box_tiff --maxpages 1 --output_dir
../tesstutorial/sintest

=== Starting training for language 'sin'
[Tue Sep 4 03:21:08 UTC 2018]
/home/ubuntu/tesseract/src/training/text2image --fonts_dir=../.fonts
--font=FreeSerif --outputbase=/home/ubuntu/tmp//fc-cache/sample_text.txt
--text=/home/ubuntu/tmp//fc-cache/sample_text.txt
--fontconfig_tmpdir=/home/ubuntu/tmp//fc-cache
Rendered page 0 to file /home/ubuntu/tmp//fc-cache/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using FreeSerif
[Tue Sep 4 03:21:10 UTC 2018]
/home/ubuntu/tesseract/src/training/text2image
--fontconfig_tmpdir=/home/ubuntu/tmp//fc-cache --fonts_dir=../.fonts
--strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0
--outputbase=/tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0 --max_pages=1
--font=FreeSerif --text=../langdata_lstm/sin/sin.training_text
Stripped 1 unrenderable words
Rendered page 0 to file /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[Tue Sep 4 03:21:11 UTC 2018]
/home/ubuntu/tesseract/src/training/unicharset_extractor
--output_unicharset /tmp/sin-2018-09-04.Wa5/sin.unicharset --norm_mode 2
/tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box
Extracting unicharset from box file
/tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box
Wrote unicharset file /tmp/sin-2018-09-04.Wa5/sin.unicharset
[Tue Sep 4 03:21:11 UTC 2018]
/home/ubuntu/tesseract/src/training/set_unicharset_properties -U
/tmp/sin-2018-09-04.Wa5/sin.unicharset -O
/tmp/sin-2018-09-04.Wa5/sin.unicharset -X
/tmp/sin-2018-09-04.Wa5/sin.xheights --script_dir=../langdata_lstm
Loaded unicharset of size 111 from file
/tmp/sin-2018-09-04.Wa5/sin.unicharset
Setting unichar properties
Setting script properties
Warning: properties incomplete for index 7 = ි
Warning: properties incomplete for index 9 = ු
Warning: properties incomplete for index 17 = ්‌
Warning: properties incomplete for index 19 = ී
Warning: properties incomplete for index 38 = ්‍ර
Warning: properties incomplete for index 66 = ₹
Warning: properties incomplete for index 73 = ූ
Warning: properties incomplete for index 79 = ්‍ය
Warning: properties incomplete for index 89 = ක්‍
Writing unicharset to file /tmp/sin-2018-09-04.Wa5/sin.unicharset

=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=../tessdata_best
[Tue Sep 4 03:21:12 UTC 2018] /home/ubuntu/tesseract/src/api/tesseract
/tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif
/tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.4-93-ge4b9c with Leptonica
Page 1

=== Constructing LSTM training data ===
[Tue Sep 4 03:21:13 UTC 2018]
/home/ubuntu/tesseract/src/training/combine_lang_model --input_unicharset
/tmp/sin-2018-09-04.Wa5/sin.unicharset --script_dir ../langdata_lstm
--words ../langdata_lstm/sin/sin.wordlist --numbers
../langdata_lstm/sin/sin.numbers --puncs ../langdata_lstm/sin/sin.punc
--output_dir ../tesstutorial/sintest --lang sin --pass_through_recoder
Loaded unicharset of size 111 from file
/tmp/sin-2018-09-04.Wa5/sin.unicharset
Setting unichar properties
Setting script properties
Warning: properties incomplete for index 7 = ි
Warning: properties incomplete for index 9 = ු
Warning: properties incomplete for index 17 = ්‌
Warning: properties incomplete for index 19 = ී
Warning: properties incomplete for index 38 = ්‍ර
Warning: properties incomplete for index 66 = ₹
Warning: properties incomplete for index 73 = ූ
Warning: properties incomplete for index 79 = ්‍ය
Warning: properties incomplete for index 89 = ක්‍
Config file is optional, continuing...
Failed to read data from: ../langdata_lstm/sin/sin.config
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg

=== Saving box/tiff pairs for training data ===
Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box to
../tesstutorial/sintest
Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif to
../tesstutorial/sintest

=== Moving lstmf files for training data ===
Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.lstmf to
../tesstutorial/sintest

Created starter traineddata for language 'sin'


Run lstmtraining to do the LSTM training for language 'sin'


real 0m5.238s
user 0m3.792s
sys 0m0.256s


On Tue, Sep 4, 2018 at 2:49 AM, Shandigutt  wrote:

> Adding more details to my query,
>
> *My tesseract  version:*
> tesseract 4.0.0-beta.4-74-gd8237
>  

[tesseract-ocr] Re: Error when executing combine_lang_model script

2018-09-03 Thread Shandigutt
Adding more details to my query,

*My tesseract  version:*
tesseract 4.0.0-beta.4-74-gd8237
 leptonica-1.77.0
  libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 
1.2.11
 Found SSE

*My OS details,*
tharaka@tharaka-laptop-ubuntu:/tmp/sin-2018-09-01.E4T$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.1 LTS
Release: 18.04
Codename: bionic

Thanks

On Tuesday, September 4, 2018 at 12:11:50 AM UTC+3, Shandigutt wrote:
>
> Hi,
>
> I'm currently in the process of training Tesseract for new language. I'm 
> currently following Tesseract wiki training guidelines 
> .
>
> Once I build Tesseract from source and installed, I first created my own 
> langdata set. 
>
> Then I crated training data and eval data using tesstrain.sh script.
>
> Then I tried to create a starter traineddata file using combine_lang_model 
> script. I used the below command for that,
>
> *./build/src/training/combine_lang_model --input_unicharset 
> ../training/sintrain/sin/sin.unicharset --script_dir ../langdata --words 
> ../langdata/sin/sin.wordlist --puncs ../langdata/sin/sin.punc --numbers 
> ../langdata/sin/sin.numbers --output_dir ../training/combined_sin 
> --version_str 1.0 --lang sin*
>
> When executing the above command I referred the langdata I created on my 
> own for words list, punctuations and numbers. Also I referred the 
> unicharset file that was created when creating training data. But I got the 
> following error output,
>
> *Loaded unicharset of size 90 from file 
> ../training/sintrain/sin/sin.unicharset*
> *Setting unichar properties*
> *Setting script properties*
> *Warning: properties incomplete for index 4 = ී*
> *Warning: properties incomplete for index 6 = ි*
> *Warning: properties incomplete for index 11 = ු*
> *Warning: properties incomplete for index 15 = ්‌*
> *Warning: properties incomplete for index 30 = ූ*
> *Warning: properties incomplete for index 44 = ්‍ර*
> *Warning: properties incomplete for index 79 = ්‍ය*
> *Warning: properties incomplete for index 82 = ක්‍*
> *Warning: properties incomplete for index 89 = ර්‍*
> *Error writing unicharset!!*
>
> Can somebody assist me on this.
>
> Thanks
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/71472620-135e-4777-8913-688e95fb9be3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Tess4J: Invalid memory access

2018-09-03 Thread Subramaniyan Suresh
Thanks for your quick turnaround.

On Mon 3 Sep, 2018, 8:50 PM Quan Nguyen,  wrote:

> The issue has been fixed in the latest releases published today.
>
> Thanks.
>
> On Sunday, September 2, 2018 at 11:50:53 AM UTC-5, Subramaniyan Suresh
> wrote:
>>
>> I am using Tess4J in my project to extract text from an image (Using
>> Eclipse IDE). I am getting the following error when I try run the OCR. Any
>> suggestion?
>>
>> *Error: Exception in thread "main" java.lang.Error: Invalid memory access*
>>
>>
>> *Note: I have attached the image file which I've used *
>>
>> *My Code*:
>>
>>
>> package tesseractTraining;
>>
>>
>> import java.io.File;
>>
>> import net.sourceforge.tess4j.*;
>>
>>
>> public class TesseractMainRunner {
>>
>> public static void main(String[] args) {
>>
>> File imageFile = new File("E:\\Tesseract\\Test Images\\sample.png");
>>
>> Tesseract instance = new Tesseract();
>>
>> try {
>>
>> instance.setDatapath("C:\\Program Files (x86)\\Tesseract-OCR\\tessdata");
>>
>> instance.setLanguage("eng");
>>
>> String result = instance.doOCR(imageFile);
>>
>> System.out.println(result);
>>
>> } catch (TesseractException e) {
>>
>> System.err.println(e.getMessage());
>>
>> }
>>
>> imageFile.exists();
>>
>> }
>>
>>
>> }
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/a384bae2-9580-4066-a0f0-9d90eacc50fd%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJGUcgM4RT%3DXdsCht42JQTyTXwfPMxMQB%2B9pz65p%2BDhVW5MO6Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Tess4J: Invalid memory access

2018-09-03 Thread Quan Nguyen
The issue has been fixed in the latest releases published today.

Thanks.

On Sunday, September 2, 2018 at 11:50:53 AM UTC-5, Subramaniyan Suresh 
wrote:
>
> I am using Tess4J in my project to extract text from an image (Using 
> Eclipse IDE). I am getting the following error when I try run the OCR. Any 
> suggestion?  
>
> *Error: Exception in thread "main" java.lang.Error: Invalid memory access*
>
>
> *Note: I have attached the image file which I've used *
>
> *My Code*:
>
>
> package tesseractTraining;
>
>
> import java.io.File;
>
> import net.sourceforge.tess4j.*;
>
>
> public class TesseractMainRunner {
>
> public static void main(String[] args) {
>
> File imageFile = new File("E:\\Tesseract\\Test Images\\sample.png");
>
> Tesseract instance = new Tesseract();
>
> try {
>
> instance.setDatapath("C:\\Program Files (x86)\\Tesseract-OCR\\tessdata");
>
> instance.setLanguage("eng");
>
> String result = instance.doOCR(imageFile);
>
> System.out.println(result);
>
> } catch (TesseractException e) {
>
> System.err.println(e.getMessage());
>
> }
>
> imageFile.exists();
>
> }
>
>
> }
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a384bae2-9580-4066-a0f0-9d90eacc50fd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.