date:20170924

Re: [tesseract-ocr] Does the number in the .exp# file type matter?

2017-09-24 Thread ShreeDevi Kumar

Please read tesstrain_utils.sh if you want to know the details.

Dictionary files are built from your sources in langdata. Unicharset is
also built from your training_text in langdata.

On 24-Sep-2017 7:05 PM, "Dan9er"  wrote:

> That answer doesn't help me.
>
> How can I add dictionary files to tesstrain?
>
> On Saturday, September 23, 2017 at 12:05:37 PM UTC-4, shree wrote:
>>
>> You cannot use a random unicharset, it needs to be the same one used for
>> training the model.
>>
>> For multiple exposures, use the following method
>>
>> training/tesstrain.sh \
>> --fonts_dir /mnt/c/Windows/Fonts \
>>  --lang eng \
>>  --noextract_font_properties  --linedata_only \
>>  --exposures "-1, 0, 1" \
>>  --langdata_dir ../langdata \
>>  --tessdata_dir ../tessdata \
>>  --fontlist \
>>   "Arial" \
>>   "Tahoma" \
>>   "Times New Roman," \
>>   "Sanskrit 2003," \
>> "FreeSerif Italic" \
>> "Times New Roman, Italic" \
>>   --output_dir ../tesstutorial/eng
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Sat, Sep 23, 2017 at 8:46 PM, Dan9er  wrote:
>>
>>> I'm making a unicharset file so I can compile DAWG dictionary files so
>>> I can use it with tesstrain.sh. I want to use multiple exposures (-1,
>>> 0,1) for the tiff/box pairs. How should name them to separate the
>>> different exposures?
>>>
>>> Can I do this?:
>>>
>>> lang.Arial.exp0
>>> lang.Arial.exp1
>>> lang.Arial.exp2
>>>
>>> Or will changing the file numbers screw things up? As an alternative,
>>> can I do this?:
>>>
>>> lang.Arial0.exp0
>>> lang.Arial1.exp0
>>> lang.Arial2.exp0
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/6e9f4a45-5dde-41f6-8a41-a403778aef54%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/f473592f-3bc3-4e8f-b625-6a14b2d3bfba%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWGPqKCNiywjaTTn%2B1ZZF4XjGE-wRCohDoeYF2gafngRw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: In Spanish language, character ‘o’ is recognized incorrectly as some round symbol

2017-09-24 Thread Quan Nguyen

It depends on your needs. There are also fast traineddata:

https://github.com/tesseract-ocr/tessdata_fast

It looks that many languages are represented.

On Saturday, September 23, 2017 at 12:38:46 PM UTC-5, Subrato Namata wrote:
>
> Thanks Quan Nguyen. My initial results show that the issue is gone. Let me 
> try with few more samples.
> Additionally, are these the best trained data of tesseract available for 
> all the other languages and we must be using these only ?
>
>
>
> On Saturday, 23 September 2017 00:02:51 UTC+5:30, Quan Nguyen wrote:
>>
>> Try best traineddata:
>>
>> https://github.com/tesseract-ocr/tessdata_best
>>
>> On Friday, September 22, 2017 at 2:24:08 AM UTC-5, Subrato Namata wrote:
>>>
>>> Environment
>>>
>>> Windows Setup: tesseract-ocr-setup-4.0.0-alpha.20170804.exe
>>> Spanish Trained Data: 
>>> https://github.com/tesseract-ocr/tessdata/raw/4.00/spa.traineddata
>>> Command Used to OCR:
>>> tesseract.exe ImageDoc.png output --oem 1 -l spa
>>> Where ImageDoc.png is a Spanish Scanned Document
>>> output is the text file output of OCRed text
>>>
>>>- Tesseract Version: 4.0
>>>- Platform: Windows version 64 Bit
>>>
>>> Current Behavior:
>>>
>>> In Spanish, character ‘o’ is recognized incorrectly as some round 
>>> symbol. Attached input file is ImageDoc.png and Error screenshot
>>>
>>> [image: spanish] 
>>> 
>>> [image: imagedoc] 
>>> 
>>>
>>>
>>>
>>>
>>> Expected Behavior:
>>>
>>> Character ‘o’ should be recognized correctly.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e662287d-7e0e-4e2a-b776-8c75057b5bdc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Does the number in the .exp# file type matter?

2017-09-24 Thread Dan9er

That answer doesn't help me.

How can I add dictionary files to tesstrain?

On Saturday, September 23, 2017 at 12:05:37 PM UTC-4, shree wrote:
>
> You cannot use a random unicharset, it needs to be the same one used for 
> training the model.
>
> For multiple exposures, use the following method
>
> training/tesstrain.sh \
> --fonts_dir /mnt/c/Windows/Fonts \
>  --lang eng \
>  --noextract_font_properties  --linedata_only \
>  --exposures "-1, 0, 1" \
>  --langdata_dir ../langdata \
>  --tessdata_dir ../tessdata \
>  --fontlist \
>   "Arial" \
>   "Tahoma" \
>   "Times New Roman," \
>   "Sanskrit 2003," \
> "FreeSerif Italic" \
> "Times New Roman, Italic" \
>   --output_dir ../tesstutorial/eng
>
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Sat, Sep 23, 2017 at 8:46 PM, Dan9er  > wrote:
>
>> I'm making a unicharset file so I can compile DAWG dictionary files so I 
>> can use it with tesstrain.sh. I want to use multiple exposures (-1, 0,1) 
>> for the tiff/box pairs. How should name them to separate the 
>> different exposures?
>>
>> Can I do this?:
>>
>> lang.Arial.exp0
>> lang.Arial.exp1
>> lang.Arial.exp2
>>
>> Or will changing the file numbers screw things up? As an alternative, can 
>> I do this?:
>>
>> lang.Arial0.exp0
>> lang.Arial1.exp0
>> lang.Arial2.exp0
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/6e9f4a45-5dde-41f6-8a41-a403778aef54%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f473592f-3bc3-4e8f-b625-6a14b2d3bfba%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Does the number in the .exp# file type matter?

[tesseract-ocr] Re: In Spanish language, character ‘o’ is recognized incorrectly as some round symbol

Re: [tesseract-ocr] Does the number in the .exp# file type matter?

3 matches

Site Navigation

Mail list logo

Footer information