[tesseract-ocr] How to generate multiple teesedit_write_images output

2018-07-02 Thread Junye Li
Hi there, 

I want to see the actual input images processed by tesseract usingthe 
command -c tesseract and I used tessedit_write_images=TRUE. 

However, when I pass multi-layer (mutiple pages) .tiff image to tesseract 
the output tessinput.tif image only contains one layer, which is the last 
page of the input image. 

Is there a way to generate output with multiple pages or multiple single 
output images?

Cheers

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b873ebd3-0630-451a-ae51-ec6647a07f37%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Tesseract v3.05.02 Training Error During Processing

2018-07-02 Thread Quan Nguyen
Wrong filename format. The box should be named `eng.dmd.exp0.box`.

On Monday, July 2, 2018 at 7:40:26 AM UTC-5, James Lipham wrote:
>
> I have also updated the image to have everything as the same 
> font/size/etc, but still, tesseract just says "Error during processing." 
> with seemingly zero information as to why.
>
> Has anyone ever experienced this? If I can't find anything else out, I 
> guess I'll just have to step through the page processing code and add in a 
> bunch of printf statements just to see where tesseract is blowing up, which 
> seems a bit overkill.
>
> -- James
>
> On Sunday, July 1, 2018 at 3:13:27 PM UTC-5, James Lipham wrote:
>>
>> Good afternoon all!
>>
>> I'm running Tesseract v3.05.02 on OSX Sierra (installed via Homebrew), 
>> and I'm trying to train a custom dataset with some fairly small images that 
>> are programmatically generated from a dot matrix display.
>>
>> When running 
>> tesseract eng.dmd.exp0.tif eng.dmd.box nobatch box.train
>>
>> I get the following information:
>>
>> Tesseract Open Source OCR Engine v3.05.02 with Leptonica
>> Page 1
>> Detected 27 diacritics
>> Error during processing.
>>
>> There is no additional information output to the console, so I really 
>> don't know what my error could be. I've looked and verified that the tif 
>> image doesn't have an alpha channel, and the box file appears to be in the 
>> appropriate format.
>>
>> Has anyone run into this before? I'm thinking it's something absurdly 
>> simple. I've attached both the TIF and box files I'm using.
>>
>> Thank you very very much!
>>
>> -- James
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/496444a6-fc35-41b3-8ae6-cd17672573e0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Encoding of string failed when finetune fot adding new fonts is fas language

2018-07-02 Thread Shree Devi Kumar
also see https://github.com/tesseract-ocr/tesseract/issues/549



On Mon, Jul 2, 2018 at 7:45 PM Shree Devi Kumar 
wrote:

> You can use find_fonts with your training_text to locate the fonts to use.
>
> Modify the following command to match your directory setup and try
>
> echo "## FIND FONTS ##"
> # Find fonts which can render your training_text. Run `fc-cache -vf` to
> refresh cache.
> # You can change the minimum coverage % as needed.
> # This process can take a while if you have a number of installed fonts.
> # Review the generated fontlist and modify, if needed.
> # 2000 fonts found. Use a smaller set
>
> nice text2image --find_fonts \
> --fonts_dir $fonts_dir \
> --text $langdata_dir/$Lang/$Lang.training_text \
> --min_coverage 0.999  \
> --render_per_font=false \
> --outputbase $langdata_dir/$Lang/$Lang \
> |& grep raw \
>  | sed -e 's/ :.*/@ \\/g' \
>  | sed -e "s/^/ '/" \
>  | sed -e "s/@/'/g" > $langdata_dir/$Lang/$Lang.fontslist.txt
>
> On Mon, Jul 2, 2018 at 12:06 PM ran go  wrote:
>
>> in my opinion error is for font-type, for some font there is no error but
>> for some other fonts there is error
>>
>> On Mon, Jul 2, 2018 at 9:15 AM, john  wrote:
>>
>>> I use tesseract 4.0.0-beta.1. downloaded from this link (UB mannheim)
>>> 
>>>
>>> On Saturday, June 30, 2018 at 7:13:30 PM UTC+4:30, shree wrote:

 Also check that there is no tab or other unprintable character in your
 training text.

 Which version of tesseract are you using? show output  of

 tesseract -v


 On Sat, Jun 30, 2018 at 8:04 PM Shree Devi Kumar 
 wrote:

> Then there must be a mismatch between the unicharset you are using and
> the training text. eg. check whether the copyright symbol is in your
> unicharset.
>
> On Sat, Jun 30, 2018 at 4:48 PM john  wrote:
>
>> I saw that link. this error occured many times,how can i prevent that?
>>
>> On Saturday, June 30, 2018 at 3:17:26 PM UTC+4:30, shree wrote:
>>>
>>> see
>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#error-messages-from-training
>>>
>>> On Sat, Jun 30, 2018 at 3:23 PM john  wrote:
>>>
 Encoding of string failed! Failure bytes: ffc2 ffa9 20
 ffd8 ffa8 ffd8 ffa7 ffd8 ffae ffd8 ffaa
 ffd9 ff86 ffd8 ffa7 20 ffd9 ff84 ffd8 
 ffa7
 ffd8 ffa4 ffd8 ffb3 20 ffdb ff8c ffd9 
 ff86
 ffd8 ffa7 ffd8 ffb1 ffdb ff8c ffd8 
 ffa7 20
 ffd8 ffa7 ffd8 ffa8 20 ffd8 ffaa ffd8 
 ffa8
 ffd8 ffab ffd9 ff87 20 ffd8 ffaf ffd8 
 ffa7
 ffd9 ff81 ffd8 ffaa ffd8 ffb3 ffd8 
 ffa7 20
 ffd9 ff86 ffdb ff8c ffd9 ff86 ffda ff86
 ffd9 ff85 ffd9 ff87 20 ffd9 ff82 ffd9 
 ff84
 ffd8 ffb7 ffd9 ff85
 Can't encode transcription: '۱۹ 2006© باختنا لاؤس یناریا اب تبثه
 دافتسا نینچمه قلطم' in language ''
 ^C

 when I finetune network for fas language i see top error?
 what is wrong with training?

 --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it,
 send an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/tesseract-ocr/11d5277e-2ef1-4ae9-8cb3-3f38290c1dfc%40googlegroups.com
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>>
>>>
>>> --
>>>
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to tesseract-oc...@googlegroups.com.
>> To post to this group, send email to tesser...@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> 

[tesseract-ocr] A friendly suggestion for the "tesseract-ocr" group members (Concern to all members)

2018-07-02 Thread cohengil333
It seems with all  languages and revisions, people (including me) tend to 
search a lot for answers here in the group.
So I have a suggestion,
Can the group administrator pin a message with a spreadsheet, which 
consists the state of each revision with the corresponding  language this 
way it would be nicely organized in a single table, and people will update 
it from time to time.

for example:


*TrainingTesseract status*

 

*Revisions*

*Language*

*3.0**0*

*3.**01*

*3.02*

*3.03*

*3.**04*

*3.0**5*

*4.0*

English

 

 

 

 

worked

 

worked

Hebrew

 

 

 

 

 

 

 

Hindi

 

 

 

 

 

 

 

Arabic

 

 

 

 

 

 

 

German

 

 

 

 

 

 

 

Chinese

 

 

 

 

 

 

 

Russian

 

 

 

 

 

 

 

Vietnamese

 

 

 

 

 

 

 

Polish

 

 

 

 

 

 

 

...

 

 

 

 

 

 




*Please let me know is it a hassle, if so I'll do my best to assist with 
this chore.*

Thank you all,
*Gil*

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1850cb75-0289-4a41-8ebc-e4d2a1c38f5c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Where can i get Other language Cube language files.

2018-07-02 Thread cohengil333
Great question, I'm stuck too with this just with Hebrew OCR.

Any suggestion?


On Tuesday, March 13, 2018 at 7:13:50 PM UTC+2, Harshit Dohare wrote:
>
> Hi,
>
> As far as I have looked into Tesseract, cube files are only available for 
> Hindi and Arabic language. 
> Check here - https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
>
> Do you know how to work with these .cube files ? Because I am trying to 
> use cube files for recognition of Hindi language on android but I am unable 
> to figure out.
>
> Thanks,
> -Harshit 
>

On Tuesday, March 13, 2018 at 7:13:50 PM UTC+2, Harshit Dohare wrote:
>
> Hi,
>
> As far as I have looked into Tesseract, cube files are only available for 
> Hindi and Arabic language. 
> Check here - https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
>
> Do you know how to work with these .cube files ? Because I am trying to 
> use cube files for recognition of Hindi language on android but I am unable 
> to figure out.
>
> Thanks,
> -Harshit 
>

On Tuesday, March 13, 2018 at 7:13:50 PM UTC+2, Harshit Dohare wrote:
>
> Hi,
>
> As far as I have looked into Tesseract, cube files are only available for 
> Hindi and Arabic language. 
> Check here - https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
>
> Do you know how to work with these .cube files ? Because I am trying to 
> use cube files for recognition of Hindi language on android but I am unable 
> to figure out.
>
> Thanks,
> -Harshit 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c3f40c4e-3835-43b2-b5b6-f1b52af5ae28%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Encoding of string failed when finetune fot adding new fonts is fas language

2018-07-02 Thread Shree Devi Kumar
You can use find_fonts with your training_text to locate the fonts to use.

Modify the following command to match your directory setup and try

echo "## FIND FONTS ##"
# Find fonts which can render your training_text. Run `fc-cache -vf` to
refresh cache.
# You can change the minimum coverage % as needed.
# This process can take a while if you have a number of installed fonts.
# Review the generated fontlist and modify, if needed.
# 2000 fonts found. Use a smaller set

nice text2image --find_fonts \
--fonts_dir $fonts_dir \
--text $langdata_dir/$Lang/$Lang.training_text \
--min_coverage 0.999  \
--render_per_font=false \
--outputbase $langdata_dir/$Lang/$Lang \
|& grep raw \
 | sed -e 's/ :.*/@ \\/g' \
 | sed -e "s/^/ '/" \
 | sed -e "s/@/'/g" > $langdata_dir/$Lang/$Lang.fontslist.txt

On Mon, Jul 2, 2018 at 12:06 PM ran go  wrote:

> in my opinion error is for font-type, for some font there is no error but
> for some other fonts there is error
>
> On Mon, Jul 2, 2018 at 9:15 AM, john  wrote:
>
>> I use tesseract 4.0.0-beta.1. downloaded from this link (UB mannheim)
>> 
>>
>> On Saturday, June 30, 2018 at 7:13:30 PM UTC+4:30, shree wrote:
>>>
>>> Also check that there is no tab or other unprintable character in your
>>> training text.
>>>
>>> Which version of tesseract are you using? show output  of
>>>
>>> tesseract -v
>>>
>>>
>>> On Sat, Jun 30, 2018 at 8:04 PM Shree Devi Kumar 
>>> wrote:
>>>
 Then there must be a mismatch between the unicharset you are using and
 the training text. eg. check whether the copyright symbol is in your
 unicharset.

 On Sat, Jun 30, 2018 at 4:48 PM john  wrote:

> I saw that link. this error occured many times,how can i prevent that?
>
> On Saturday, June 30, 2018 at 3:17:26 PM UTC+4:30, shree wrote:
>>
>> see
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#error-messages-from-training
>>
>> On Sat, Jun 30, 2018 at 3:23 PM john  wrote:
>>
>>> Encoding of string failed! Failure bytes: ffc2 ffa9 20
>>> ffd8 ffa8 ffd8 ffa7 ffd8 ffae ffd8 ffaa
>>> ffd9 ff86 ffd8 ffa7 20 ffd9 ff84 ffd8 
>>> ffa7
>>> ffd8 ffa4 ffd8 ffb3 20 ffdb ff8c ffd9 
>>> ff86
>>> ffd8 ffa7 ffd8 ffb1 ffdb ff8c ffd8 ffa7 
>>> 20
>>> ffd8 ffa7 ffd8 ffa8 20 ffd8 ffaa ffd8 
>>> ffa8
>>> ffd8 ffab ffd9 ff87 20 ffd8 ffaf ffd8 
>>> ffa7
>>> ffd9 ff81 ffd8 ffaa ffd8 ffb3 ffd8 ffa7 
>>> 20
>>> ffd9 ff86 ffdb ff8c ffd9 ff86 ffda ff86
>>> ffd9 ff85 ffd9 ff87 20 ffd9 ff82 ffd9 
>>> ff84
>>> ffd8 ffb7 ffd9 ff85
>>> Can't encode transcription: '۱۹ 2006© باختنا لاؤس یناریا اب تبثه
>>> دافتسا نینچمه قلطم' in language ''
>>> ^C
>>>
>>> when I finetune network for fas language i see top error?
>>> what is wrong with training?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it,
>>> send an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/11d5277e-2ef1-4ae9-8cb3-3f38290c1dfc%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>>
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/bb5696d3-f251-4181-a1a2-dcd6b0bbdf62%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


 --

 

[tesseract-ocr] Re: Tesseract v3.05.02 Training Error During Processing

2018-07-02 Thread James Lipham
I have also updated the image to have everything as the same font/size/etc, 
but still, tesseract just says "Error during processing." with seemingly 
zero information as to why.

Has anyone ever experienced this? If I can't find anything else out, I 
guess I'll just have to step through the page processing code and add in a 
bunch of printf statements just to see where tesseract is blowing up, which 
seems a bit overkill.

-- James

On Sunday, July 1, 2018 at 3:13:27 PM UTC-5, James Lipham wrote:
>
> Good afternoon all!
>
> I'm running Tesseract v3.05.02 on OSX Sierra (installed via Homebrew), and 
> I'm trying to train a custom dataset with some fairly small images that are 
> programmatically generated from a dot matrix display.
>
> When running 
> tesseract eng.dmd.exp0.tif eng.dmd.box nobatch box.train
>
> I get the following information:
>
> Tesseract Open Source OCR Engine v3.05.02 with Leptonica
> Page 1
> Detected 27 diacritics
> Error during processing.
>
> There is no additional information output to the console, so I really 
> don't know what my error could be. I've looked and verified that the tif 
> image doesn't have an alpha channel, and the box file appears to be in the 
> appropriate format.
>
> Has anyone run into this before? I'm thinking it's something absurdly 
> simple. I've attached both the TIF and box files I'm using.
>
> Thank you very very much!
>
> -- James
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/26e2884b-e09a-4b5d-8033-5aef7afad1c6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Train 2 language together

2018-07-02 Thread Zohreh Khosrobeygi
Thx. you're right.

On Sunday, July 1, 2018 at 10:02:55 PM UTC+4:30, shree wrote:
>
> The font being used does not support English.
>
> On Sun, Jul 1, 2018 at 10:06 PM Zohreh Khosrobeygi  > wrote:
>
>> Hi,
>> I have been training the text:
>>
>> 272-135031- BECAUSE YOU WERE SLEEPING INSTEAD OWHILE POOR SHAGGY 
>> SITS THERE A COOING DOVE
>> فیلم و و , منابع سال آگهی آخرين آخرین بود. ساخت و کنی
>>
>> It means the text contains Persian and English. But when Tiff file has 
>> been created, all English text have been removed. The Tiff file contains 
>> this:
>>
>> 272-135031-
>> فیلم و و , منابع سال آگهی آخرين آخرین بود. ساخت و کنی
>>
>> But for Persian we need to train both language together.
>> How can I solve the problem? How can I train 2 language together?
>> Thanks a lot.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/0e854ed2-3ca2-48e7-af79-9f4f1924e38b%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bc68bba3-af00-49c6-92eb-81328a307f95%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Fine tuning existing model

2018-07-02 Thread Lorenzo Bolzani
Hi Shree,
I replaced the line:

 merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset
$(TRAIN)/my.unicharset  "$@"

with:

 cp "$(TRAIN)/my.unicharset" "data/unicharset"

(I write this in case someone else is following this thread).

And now I have a fine tuned brand new model with only the characters I
need. Nice :)

For the training I'm using actual crops from the documents I need to ocr,
painfully hand labeled.

About the number of iterations I'm trying to figure it out. I've seen that
there is an eval/train split, I've set it to 80/20.

I did 300/600/1000/5000/7500/1 iteration and checked the model with:

lstmeval --model export/$1.traineddata --eval_listfile data/list.eval 2>&1
| grep iteration

and I see that the eval error keeps going down, with a big error drop from
1.17 to 0.5 passing from 7500 to 1. My characters are very noisy and
irregular and my lines are very short, 1 to 4 words at most. Maybe this is
the reason why I need more iterations.

I'm fine tuning from italian, the language of my documents, I'll try eng
too to see if it works better. Now that the pipeline is in place it's easy
to try different options.


Thank you for your help so far.


Bye

Lorenzo


2018-06-30 6:18 GMT+02:00 Shree Devi Kumar :

> >
> ​
> The problem was a "-gt.txt" rather than a ".gt.txt" as in my train files.
> Now I can run your script directly.
>
> Oh, I remember now. I had changed that for ease in renaming files for some
> reason.
>
> > In this way can I train a model that, for example, only recognize
> uppercase characters, or numbers, simply by providing only uppercase
> training data? Or is there something else to configure?
>
> You could try finetune from English. Remove the line regarding merge of
> unicharsets from my makefile (use command from original script). 300
> iterations should be enough as you are not adding any characters. Try to
> have a training text which resembles the kind of words that you expect to
> OCR.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/CAG2NduUpE8TeQXqto-Ahb7Mm%3DR4C5qOavthm0Y30ZbnvdrWr6w%
> 40mail.gmail.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLzOaDqo9Ja%2BG5pa9hCH0i6BTN8ShEj4ZUxa%2BH5qANWyKQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] recognising roman with sanskrit diacritics

2018-07-02 Thread yajva
Many thanks. Downloaded and using.
Will wait for next ver.


On Sunday, July 1, 2018 at 12:21:19 AM UTC+5:30, shree wrote:
>
> I have uploaded a new version of traineddata file at 
>
> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/iast-layer-18003.traineddata
>
> Attached is the OCRed output for pages 13-24 of dark pdf with it.
>
> I am still training a different variation.
>
>
>
> On Wed, Jun 27, 2018 at 6:46 PM Shree Devi Kumar  > wrote:
>
>> ok. I will take a look.
>>
>> On Wed, Jun 27, 2018 at 5:04 PM yajva > 
>> wrote:
>>
>>> Checked with both light & dark pdfs. The results are very good. Thanks.
>>>
>>> A few concerns. E is consistently missed in both. J is missed 
>>> consistently in darker image but recognized as T in dark image. ṝ is 
>>> recognized as ṛ consistently. Can these be addressed ?
>>> I am using tesseract 4 alpha windows build from command line.
>>>
>>> Are the dev files in repos ?
>>>
>>>
>>> On Tuesday, June 26, 2018 at 11:06:06 PM UTC+5:30, shree wrote:

 I had used ghostview to convert PDF to tif or png.

 You can ocr PDF directly with gimagereader using the traineddata file I 
 sent.

 See links for new windows binaries in msg below.


 At last, here are some fresh builds:


 https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe

 https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe

 I'd be also interested in testing of the tessdata manager, which should 
 now also properly handle script tessdatas

 On Tue 26 Jun, 2018, 10:59 PM yajva,  wrote:

> The doc is diff ver of the same text. Here's the doc used for the 
> first. png. This is slightly darker, but the one sent earlier is cleaner. 
> Let me know which is more amenable for OCRing. I use PDF Shaper to 
> extract 
> images and convert to png using xnview.
>
> On Tuesday, June 26, 2018 at 7:48:28 PM UTC+5:30, shree wrote:
>>
>> Traineddata file is attached for use with tesseract4.0.0-beta.
>>
>> How did you create the test png from the pdf? I am not getting as 
>> good quality, tried various settings with irfanview.
>>
>>
>>
>> On Tue, Jun 26, 2018 at 4:58 PM yajva  wrote:
>>
>>> Sorry for the delay, my system was down.
>>>
>>> I am getting "Page not Found" for the link given. Can you pl 
>>> re-check?
>>>
>>> Here's the doc I am trying to OCR
>>>
>>>
>>> On Saturday, June 23, 2018 at 9:46:08 PM UTC+5:30, shree wrote:

 Please test with traineddata file from 
 https://github.com/Shreeshrii/tessdata_sanskrit/tree/master/iast-plus1 
 

 Need to check that is it not overfitted.

 Please share a couple more images which I can use for testing.


 On Thu, Jun 21, 2018 at 11:38 PM yajva  wrote:

> one more correction.
>
>
> On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva wrote:
>>
>> done
>>
>> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree wrote:
>>>
>>> I am attaching the OCRed text. Please correct it so that  I can 
>>> use as groundtruth for further training and testing.
>>>
>>> On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar <
>>> shree...@gmail.com> wrote:
>>>
 I had done a training for sanskrit for both devanagari and IAST 
 but it does not include cedilla for Sh 

 I will add it and let you know.

 On Wed 20 Jun, 2018, 1:17 AM yajva,  
 wrote:

> I have tried Google OCR for recognizing Sanskrit text in Roman 
> with diacritics (IAST). It recognizes above macron but not dots 
> below also 
> joining grave and accent. Is there any traineddata available for 
> tesseract 
> that can do this with good accuracy ? Attached a sample page that 
> I am 
> interested in.
>
> -- 
> You received this message because you are subscribed to the 
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from 
> it, send an email to tesseract-oc...@googlegroups.com.
> To post to this group, send email to 
> tesser...@googlegroups.com.
> Visit this group at 
> https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit 
> 

Re: [tesseract-ocr] Encoding of string failed when finetune fot adding new fonts is fas language

2018-07-02 Thread ran go
in my opinion error is for font-type, for some font there is no error but
for some other fonts there is error

On Mon, Jul 2, 2018 at 9:15 AM, john  wrote:

> I use tesseract 4.0.0-beta.1. downloaded from this link (UB mannheim)
> 
>
> On Saturday, June 30, 2018 at 7:13:30 PM UTC+4:30, shree wrote:
>>
>> Also check that there is no tab or other unprintable character in your
>> training text.
>>
>> Which version of tesseract are you using? show output  of
>>
>> tesseract -v
>>
>>
>> On Sat, Jun 30, 2018 at 8:04 PM Shree Devi Kumar 
>> wrote:
>>
>>> Then there must be a mismatch between the unicharset you are using and
>>> the training text. eg. check whether the copyright symbol is in your
>>> unicharset.
>>>
>>> On Sat, Jun 30, 2018 at 4:48 PM john  wrote:
>>>
 I saw that link. this error occured many times,how can i prevent that?

 On Saturday, June 30, 2018 at 3:17:26 PM UTC+4:30, shree wrote:
>
> see https://github.com/tesseract-ocr/tesseract/wiki/Training
> Tesseract-4.00#error-messages-from-training
>
> On Sat, Jun 30, 2018 at 3:23 PM john  wrote:
>
>> Encoding of string failed! Failure bytes: ffc2 ffa9 20
>> ffd8 ffa8 ffd8 ffa7 ffd8 ffae ffd8 ffaa
>> ffd9 ff86 ffd8 ffa7 20 ffd9 ff84 ffd8 
>> ffa7
>> ffd8 ffa4 ffd8 ffb3 20 ffdb ff8c ffd9 
>> ff86
>> ffd8 ffa7 ffd8 ffb1 ffdb ff8c ffd8 ffa7 
>> 20
>> ffd8 ffa7 ffd8 ffa8 20 ffd8 ffaa ffd8 
>> ffa8
>> ffd8 ffab ffd9 ff87 20 ffd8 ffaf ffd8 
>> ffa7
>> ffd9 ff81 ffd8 ffaa ffd8 ffb3 ffd8 ffa7 
>> 20
>> ffd9 ff86 ffdb ff8c ffd9 ff86 ffda ff86
>> ffd9 ff85 ffd9 ff87 20 ffd9 ff82 ffd9 
>> ff84
>> ffd8 ffb7 ffd9 ff85
>> Can't encode transcription: '۱۹ 2006© باختنا لاؤس یناریا اب تبثه
>> دافتسا نینچمه قلطم' in language ''
>> ^C
>>
>> when I finetune network for fas language i see top error?
>> what is wrong with training?
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to tesseract-oc...@googlegroups.com.
>> To post to this group, send email to tesser...@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/11d5277e-2ef
>> 1-4ae9-8cb3-3f38290c1dfc%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> --
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
 --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit https://groups.google.com/d/ms
 gid/tesseract-ocr/bb5696d3-f251-4181-a1a2-dcd6b0bbdf62%40goo
 glegroups.com
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>>
>>>
>>> --
>>>
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>
>>
>> --
>>
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/fb051eec-930c-4114-b2d7-a574aa6e79b5%
> 40googlegroups.com
> 
> .
>
> For more