Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2023-09-15 Thread Des Bw
I have exactly the same problem for Amharic. I find three characters 
missing; and they are screwing the Ocr result. 
Dear Shree, can you help me please?

On Friday, January 6, 2017 at 3:50:38 PM UTC+3 shree wrote:

> I have uploaded modified nor.traineddata at
>
> https://github.com/Shreeshrii/tessdata4alpha/blob/master/nor.traineddata
>
> See attached log and info file for commands used in training. It took 
> about 9 hours on my pc - about 1700 iterations only and then my PC froze so 
> I rebooted and created the traineddata for norlayer0.853_1615.lstm i.e. 
> 0.853 % character error rate at iteration number 1615.
>
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Fri, Jan 6, 2017 at 5:59 PM, ShreeDevi Kumar  
> wrote:
>
>> @Peter, Have you tried the 4.0.0alpha version yet?
>>
>> @Ludvig F. Aarstad - Add a layer training worked for adding 'Æ' - I will 
>> upload the new traineddata so that you can test. You will need 4.0.alpha 
>> version for testing.
>>
>> Here is couple of the training tifs and OCRed text.  
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Fri, Jan 6, 2017 at 5:01 PM, Peter  wrote:
>>
>>>
>>>
>>> Den torsdag 5 januari 2017 kl. 04:39:01 UTC+1 skrev shree:

 Ray is planning to retrain the languages for the new 4.0.0 version 
 sometime in January. So it would be helpful if you could open an issue on 
 https://github.com/tesseract-ocr/langdata/issues with this information.

>>>
>>> Is it possible to contribute training data for this effort? I realise 
>>> swedish will not be on top of the list but I think it would be easy to 
>>> involve some of the research community here in contributing training data 
>>> if it could improve the language model.
>>>
>>> /Peter 
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/9788db26-bb8a-4861-b29e-80db2b5a687f%40googlegroups.com
>>>  
>>> 
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/be8e5df8-1283-4aa1-9b92-b3a4afc585f3n%40googlegroups.com.


Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-09 Thread Ludvig F Aarstad
I think I might stick with the postprocessing for now, too much oddities I need 
tonlearn to be able to compile it ;). Still, I think this project is awesome 
and I might take it up a notch and try the same I am doing now just using .net 
code :)

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a794e9a2-4ae7-45df-9b1b-b04ff7ca59da%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-09 Thread ShreeDevi Kumar
Actually postprocessing with replace for AE will be the best bet as 4.0 is
slower than the tesseract engine for latin-based scripts.

You can experiment with 4.0.0alpha.

See https://github.com/tesseract-ocr/tesseract/wiki/Compiling
you will also need to compile the latest version of leptonica before that.

Sources are at:
https://github.com/DanBloomberg/leptonica.git
https://github.com/tesseract-ocr/tesseract.git

There is no separate src directory for tesseract.

I used git clone to get the master branch and then use pull origin to
update it. You can also download zip with current master.



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jan 9, 2017 at 1:18 PM, Ludvig F Aarstad  wrote:

> No worries, I will play around and see what I can get working. For now I
> am using a simple replace in my script to handle the Æ.
> How would I go about if I were to compile tesseract 4.0 alpha using git
> and cmake? The wiki says the 4.0 alpha Source code is available in the
> master branch of the repository, but I have yet to find it...The compiling
> part seems straght-forward enough, but I need the source ;).
>
> Tried installing the gimagereader hoping that it would give me the dll for
> tesseract 4.0, but no.
>
> mandag 9. januar 2017 08.34.18 UTC+1 skrev shree følgende:
>
>> Sorry, I am not familiar with powershell and nuget.
>>
>> If you are on Windows, you can try the experimental binaries for
>> 4.0.0alpha for gimagereader, gui front-end to Tesseract-ocr. You can ocr a
>> pdf directly or load multiple images at the same time.
>>
>> - excuse the brevity, sent from mobile
>>
>> On 09-Jan-2017 12:49 PM, "Ludvig F Aarstad"  wrote:
>>
>>> Thanks Shree :D. Really appreciate it. Will this work with v3.03 too? I
>>> am basing my code on this: https://github.com/jourdant/po
>>> wershell-paperless and there is a script to initialize the environment
>>> that is getting the tesseract files from here:
>>> https://nuget.org/api/v2/package/tesseract-ocr. Would you be able to
>>> point me in the right direction on how to move this from 3.03 to the
>>> 4.0alpha?
>>>
>>>
>>>
>>> fredag 6. januar 2017 13.50.38 UTC+1 skrev shree følgende:
>>>
 I have uploaded modified nor.traineddata at

 https://github.com/Shreeshrii/tessdata4alpha/blob/master/nor
 .traineddata

 See attached log and info file for commands used in training. It took
 about 9 hours on my pc - about 1700 iterations only and then my PC froze so
 I rebooted and created the traineddata for norlayer0.853_1615.lstm i.e.
 0.853 % character error rate at iteration number 1615.


 ShreeDevi
 
 भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

 On Fri, Jan 6, 2017 at 5:59 PM, ShreeDevi Kumar 
 wrote:

> @Peter, Have you tried the 4.0.0alpha version yet?
>
> @Ludvig F. Aarstad - Add a layer training worked for adding 'Æ' - I
> will upload the new traineddata so that you can test. You will need
> 4.0.alpha version for testing.
>
> Here is couple of the training tifs and OCRed text.
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Fri, Jan 6, 2017 at 5:01 PM, Peter  wrote:
>
>>
>>
>> Den torsdag 5 januari 2017 kl. 04:39:01 UTC+1 skrev shree:
>>>
>>> Ray is planning to retrain the languages for the new 4.0.0 version
>>> sometime in January. So it would be helpful if you could open an issue 
>>> on
>>> https://github.com/tesseract-ocr/langdata/issues with this
>>> information.
>>>
>>
>> Is it possible to contribute training data for this effort? I realise
>> swedish will not be on top of the list but I think it would be easy to
>> involve some of the research community here in contributing training data
>> if it could improve the language model.
>>
>> /Peter
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to tesseract-oc...@googlegroups.com.
>> To post to this group, send email to tesser...@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/9788db26-bb8
>> a-4861-b29e-80db2b5a687f%40googlegroups.com
>> 
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
 --
>>> 

Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-08 Thread Ludvig F Aarstad
No worries, I will play around and see what I can get working. For now I am 
using a simple replace in my script to handle the Æ.
How would I go about if I were to compile tesseract 4.0 alpha using git and 
cmake? The wiki says the 4.0 alpha Source code is available in the master 
branch of the repository, but I have yet to find it...The compiling part 
seems straght-forward enough, but I need the source ;).

Tried installing the gimagereader hoping that it would give me the dll for 
tesseract 4.0, but no. 

mandag 9. januar 2017 08.34.18 UTC+1 skrev shree følgende:

> Sorry, I am not familiar with powershell and nuget.
>
> If you are on Windows, you can try the experimental binaries for 
> 4.0.0alpha for gimagereader, gui front-end to Tesseract-ocr. You can ocr a 
> pdf directly or load multiple images at the same time.
>
> - excuse the brevity, sent from mobile
>
> On 09-Jan-2017 12:49 PM, "Ludvig F Aarstad"  > wrote:
>
>> Thanks Shree :D. Really appreciate it. Will this work with v3.03 too? I 
>> am basing my code on this: 
>> https://github.com/jourdant/powershell-paperless and there is a script 
>> to initialize the environment that is getting the tesseract files from 
>> here: https://nuget.org/api/v2/package/tesseract-ocr. Would you be able 
>> to point me in the right direction on how to move this from 3.03 to the 
>> 4.0alpha?
>>
>>
>>
>> fredag 6. januar 2017 13.50.38 UTC+1 skrev shree følgende:
>>
>>> I have uploaded modified nor.traineddata at
>>>
>>> https://github.com/Shreeshrii/tessdata4alpha/blob/master/nor.traineddata
>>>
>>> See attached log and info file for commands used in training. It took 
>>> about 9 hours on my pc - about 1700 iterations only and then my PC froze so 
>>> I rebooted and created the traineddata for norlayer0.853_1615.lstm i.e. 
>>> 0.853 % character error rate at iteration number 1615.
>>>
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Fri, Jan 6, 2017 at 5:59 PM, ShreeDevi Kumar  
>>> wrote:
>>>
 @Peter, Have you tried the 4.0.0alpha version yet?

 @Ludvig F. Aarstad - Add a layer training worked for adding 'Æ' - I 
 will upload the new traineddata so that you can test. You will need 
 4.0.alpha version for testing.

 Here is couple of the training tifs and OCRed text.  

 ShreeDevi
 
 भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

 On Fri, Jan 6, 2017 at 5:01 PM, Peter  wrote:

>
>
> Den torsdag 5 januari 2017 kl. 04:39:01 UTC+1 skrev shree:
>>
>> Ray is planning to retrain the languages for the new 4.0.0 version 
>> sometime in January. So it would be helpful if you could open an issue 
>> on 
>> https://github.com/tesseract-ocr/langdata/issues with this 
>> information.
>>
>
> Is it possible to contribute training data for this effort? I realise 
> swedish will not be on top of the list but I think it would be easy to 
> involve some of the research community here in contributing training data 
> if it could improve the language model.
>
> /Peter 
>
> -- 
> You received this message because you are subscribed to the Google 
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send 
> an email to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/9788db26-bb8a-4861-b29e-80db2b5a687f%40googlegroups.com
>  
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>


>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/f2ddc038-3409-44e6-8b00-2354a95d3ba6%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop 

Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-08 Thread ShreeDevi Kumar
Sorry, I am not familiar with powershell and nuget.

If you are on Windows, you can try the experimental binaries for 4.0.0alpha
for gimagereader, gui front-end to Tesseract-ocr. You can ocr a pdf
directly or load multiple images at the same time.

- excuse the brevity, sent from mobile

On 09-Jan-2017 12:49 PM, "Ludvig F Aarstad"  wrote:

> Thanks Shree :D. Really appreciate it. Will this work with v3.03 too? I am
> basing my code on this: https://github.com/jourdant/powershell-paperless
> and there is a script to initialize the environment that is getting the
> tesseract files from here: https://nuget.org/api/v2/package/tesseract-ocr.
> Would you be able to point me in the right direction on how to move this
> from 3.03 to the 4.0alpha?
>
>
>
> fredag 6. januar 2017 13.50.38 UTC+1 skrev shree følgende:
>
>> I have uploaded modified nor.traineddata at
>>
>> https://github.com/Shreeshrii/tessdata4alpha/blob/master/nor.traineddata
>>
>> See attached log and info file for commands used in training. It took
>> about 9 hours on my pc - about 1700 iterations only and then my PC froze so
>> I rebooted and created the traineddata for norlayer0.853_1615.lstm i.e.
>> 0.853 % character error rate at iteration number 1615.
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Fri, Jan 6, 2017 at 5:59 PM, ShreeDevi Kumar 
>> wrote:
>>
>>> @Peter, Have you tried the 4.0.0alpha version yet?
>>>
>>> @Ludvig F. Aarstad - Add a layer training worked for adding 'Æ' - I
>>> will upload the new traineddata so that you can test. You will need
>>> 4.0.alpha version for testing.
>>>
>>> Here is couple of the training tifs and OCRed text.
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Fri, Jan 6, 2017 at 5:01 PM, Peter  wrote:
>>>


 Den torsdag 5 januari 2017 kl. 04:39:01 UTC+1 skrev shree:
>
> Ray is planning to retrain the languages for the new 4.0.0 version
> sometime in January. So it would be helpful if you could open an issue on
> https://github.com/tesseract-ocr/langdata/issues with this
> information.
>

 Is it possible to contribute training data for this effort? I realise
 swedish will not be on top of the list but I think it would be easy to
 involve some of the research community here in contributing training data
 if it could improve the language model.

 /Peter

 --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit https://groups.google.com/d/ms
 gid/tesseract-ocr/9788db26-bb8a-4861-b29e-80db2b5a687f%40goo
 glegroups.com
 
 .

 For more options, visit https://groups.google.com/d/optout.

>>>
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/f2ddc038-3409-44e6-8b00-2354a95d3ba6%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXUj4Nf5wcpJfHPnrCt3Ds1BbVD3KcMPEUYqQdnORiPHQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-08 Thread Ludvig F Aarstad
Thanks Shree :D. Really appreciate it. Will this work with v3.03 too? I am 
basing my code on this: https://github.com/jourdant/powershell-paperless 
and there is a script to initialize the environment that is getting the 
tesseract files from here: https://nuget.org/api/v2/package/tesseract-ocr. 
Would you be able to point me in the right direction on how to move this 
from 3.03 to the 4.0alpha?



fredag 6. januar 2017 13.50.38 UTC+1 skrev shree følgende:

> I have uploaded modified nor.traineddata at
>
> https://github.com/Shreeshrii/tessdata4alpha/blob/master/nor.traineddata
>
> See attached log and info file for commands used in training. It took 
> about 9 hours on my pc - about 1700 iterations only and then my PC froze so 
> I rebooted and created the traineddata for norlayer0.853_1615.lstm i.e. 
> 0.853 % character error rate at iteration number 1615.
>
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Fri, Jan 6, 2017 at 5:59 PM, ShreeDevi Kumar  > wrote:
>
>> @Peter, Have you tried the 4.0.0alpha version yet?
>>
>> @Ludvig F. Aarstad - Add a layer training worked for adding 'Æ' - I will 
>> upload the new traineddata so that you can test. You will need 4.0.alpha 
>> version for testing.
>>
>> Here is couple of the training tifs and OCRed text.  
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Fri, Jan 6, 2017 at 5:01 PM, Peter > > wrote:
>>
>>>
>>>
>>> Den torsdag 5 januari 2017 kl. 04:39:01 UTC+1 skrev shree:

 Ray is planning to retrain the languages for the new 4.0.0 version 
 sometime in January. So it would be helpful if you could open an issue on 
 https://github.com/tesseract-ocr/langdata/issues with this information.

>>>
>>> Is it possible to contribute training data for this effort? I realise 
>>> swedish will not be on top of the list but I think it would be easy to 
>>> involve some of the research community here in contributing training data 
>>> if it could improve the language model.
>>>
>>> /Peter 
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com .
>>> To post to this group, send email to tesser...@googlegroups.com 
>>> .
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/9788db26-bb8a-4861-b29e-80db2b5a687f%40googlegroups.com
>>>  
>>> 
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f2ddc038-3409-44e6-8b00-2354a95d3ba6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-06 Thread ShreeDevi Kumar
I have uploaded modified nor.traineddata at

https://github.com/Shreeshrii/tessdata4alpha/blob/master/nor.traineddata

See attached log and info file for commands used in training. It took about
9 hours on my pc - about 1700 iterations only and then my PC froze so I
rebooted and created the traineddata for norlayer0.853_1615.lstm i.e. 0.853
% character error rate at iteration number 1615.


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Jan 6, 2017 at 5:59 PM, ShreeDevi Kumar 
wrote:

> @Peter, Have you tried the 4.0.0alpha version yet?
>
> @Ludvig F. Aarstad - Add a layer training worked for adding 'Æ' - I will
> upload the new traineddata so that you can test. You will need 4.0.alpha
> version for testing.
>
> Here is couple of the training tifs and OCRed text.
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Fri, Jan 6, 2017 at 5:01 PM, Peter  wrote:
>
>>
>>
>> Den torsdag 5 januari 2017 kl. 04:39:01 UTC+1 skrev shree:
>>>
>>> Ray is planning to retrain the languages for the new 4.0.0 version
>>> sometime in January. So it would be helpful if you could open an issue on
>>> https://github.com/tesseract-ocr/langdata/issues with this information.
>>>
>>
>> Is it possible to contribute training data for this effort? I realise
>> swedish will not be on top of the list but I think it would be easy to
>> involve some of the research community here in contributing training data
>> if it could improve the language model.
>>
>> /Peter
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/9788db26-bb8a-4861-b29e-80db2b5a687f%40googlegroups.com
>> 
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXOW8gDtXxKSmavVBocM7ErH3MMOcdZe9ehEYUUW0VNzQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
-
// Error rate at which to transition to stage 1.
const double kStageTransitionThreshold = 10.0;

// Appends  iteration learning_iteration()/training_iteration()/
// sample_iteration() to the log_msg.

 // Delta error is the fraction of timesteps with >0.5 error in the top choice
  // score. If zero, then the top choice characters are guaranteed correct,
  // even when there is residue in the RMS error.

  // Skip ratio measures the difference between sample_iteration_ and
  // training_iteration_, which reflects the number of unusable samples,
  // usually due to unencodable truth text, or the text not fitting in the
  // space for the output.

---
$ mkdir -p ~/tesstutorial/nor_layer
$ combine_tessdata -e ../tessdata/nor.traineddata \
>   ~/tesstutorial/nor_layer/nor.lstm
Extracting tessdata components from ../tessdata/nor.traineddata
Wrote /home/shree/tesstutorial/nor_layer/nor.lstm
$  lstmtraining -U ~/tesstutorial/nor/nor.unicharset \
>   --script_dir ../langdata  --debug_interval 0 \
>   --continue_from ~/tesstutorial/nor_layer/nor.lstm \
>   --append_index 5 --net_spec '[Lfx256 O1c105]' \
>   --model_output ~/tesstutorial/nor_layer/norlayer \
>   --train_listfile ~/tesstutorial/nor/nor.training_files.txt \
>   --max_iterations 5
Loaded file /home/shree/tesstutorial/nor_layer/nor.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from /home/shree/tesstutorial/nor_layer/nor.lstm
Other case É of é is not in unicharset
Other case Ö of ö is not in unicharset
Other case Ä of ä is not in unicharset
Appending a new network to an old one!!Setting unichar properties
Setting properties for script Common
Setting properties for script Latin
Warning: given outputs 105 not equal to unicharset of 101.
Num outputs,weights in serial:
  Lfx256:256, 394240
  Fc101:101, 25957
Total weights = 420197
Built network:[1,0,0,1[C5,5Ft16]Mp3,3Lfys64Lfx128Lrx128Lfx256Fc101] from 

Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-06 Thread Peter


Den torsdag 5 januari 2017 kl. 04:39:01 UTC+1 skrev shree:
>
> Ray is planning to retrain the languages for the new 4.0.0 version 
> sometime in January. So it would be helpful if you could open an issue on 
> https://github.com/tesseract-ocr/langdata/issues with this information.
>

Is it possible to contribute training data for this effort? I realise 
swedish will not be on top of the list but I think it would be easy to 
involve some of the research community here in contributing training data 
if it could improve the language model.

/Peter 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9788db26-bb8a-4861-b29e-80db2b5a687f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-05 Thread ShreeDevi Kumar
Tried 'Finetune' - that does not help in addition of a character.

Trying 'Add a layer' now.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jan 5, 2017 at 8:59 PM, Ludvig F Aarstad  wrote:

> Fantastic, thanks:).
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/3d46bbdd-bfe4-46be-8bdb-aff48e3f00f1%
> 40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWj35P6rhR83u1mnhKkBE1KBf228zGKG234Ukx%3DqaCQrg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-05 Thread Ludvig F Aarstad
Fantastic, thanks:).

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3d46bbdd-bfe4-46be-8bdb-aff48e3f00f1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-05 Thread ShreeDevi Kumar
I will give it a try and let you know.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWwayDgdPrspK-guAx7Rpvk_FWc-4XRnV863pVHj7hRSg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-05 Thread Ludvig F Aarstad
I can come up with several samples, if that helps.
I also realized that the occurrence of Æ in the beginning of a sentence is 
quite rare. It will in most cases only be for names of people (surnames 
mostly) and names of places and streets in addition to some specific 
Norwegian words that can occur in the beginning of a sentence and thus 
require the capital Æ.

Some samples (English counterpart added for Reference):
Ærfuglveien 44 er adressen jeg bor på - Ærfuglveien 44 is the address where 
I live.
Min adresse er Ærfuglgaten 73. - My address is Ærfuglgaten 73.
Ærlighet varer lengst. - Honesty lasts the longest.
Ærfuglen er den største andearten i vårt land. - The eider is the largest 
duck species in our country.
Ærekrenkelse er en handling som består i å krenke en annens æresfølelse, 
eller opptre på en måte som er egnet til å skade en annens gode navn og 
rykte eller til å utsette ham for hat, ringeakt eller tap av den for hans 
stilling eller næring fornødne tillit. - Defamation is an action that is to 
violate another's sense of honor, or act in a manner which is likely to harm 
someone's 
good name and reputation or to expose him to hatred, contempt, or loss of it 
for his position or business confidence necessary.
Æsene lå i kamp med en annen gudeslekt, vanene. - Æsir was in fight with 
another race of gods, the vanes.
Ærgjerrighet har vært viktig for mange av oss. Da vi var småjenter, skjønte 
vi at det er viktig å arbeide hardt og bli til noe. - Ambition has been 
important to many of us. When we were little girls, we realized that it is 
important to work hard and become something.
Det var Æsene som var snille. - It was Æsir who was the nice ones.

Will this suffice?



torsdag 5. januar 2017 04.39.01 UTC+1 skrev shree følgende:

> Ray is planning to retrain the languages for the new 4.0.0 version 
> sometime in January. So it would be helpful if you could open an issue on 
> https://github.com/tesseract-ocr/langdata/issues with this information.
>
> Also, if you can provide a sample representative Norwegian text including Æ, 
> I will try the finetune training procedure outlined by Ray in 
> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Wed, Jan 4, 2017 at 8:57 PM, Ludvig F Aarstad  > wrote:
>
>> If someone feels up to it, any chance of dumbing down the procedure for 
>> adding in a missing letter in the norwegian language? I am happy tondl the 
>> legwork, just need to understand the concept, and I don't quite understand 
>> it when reading the guides.
>> An easy list containing the steps would do just fine.
>>
>> Something like:
>> 1. Create an image of the letter to add
>> 2. Update wordlist
>> 3. etc etc
>> 4. build something
>> 5. upload to github
>>
>> Or am I simply totally off the track?
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/27ab99d7-eab0-4b9f-9086-2ccb16292ac9%40googlegroups.com
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/190c9b42-c761-4006-a4e7-9d8d8f19ceb5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-04 Thread ShreeDevi Kumar
Ray is planning to retrain the languages for the new 4.0.0 version sometime
in January. So it would be helpful if you could open an issue on
https://github.com/tesseract-ocr/langdata/issues with this information.

Also, if you can provide a sample representative Norwegian text including Æ,
I will try the finetune training procedure outlined by Ray in
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Jan 4, 2017 at 8:57 PM, Ludvig F Aarstad  wrote:

> If someone feels up to it, any chance of dumbing down the procedure for
> adding in a missing letter in the norwegian language? I am happy tondl the
> legwork, just need to understand the concept, and I don't quite understand
> it when reading the guides.
> An easy list containing the steps would do just fine.
>
> Something like:
> 1. Create an image of the letter to add
> 2. Update wordlist
> 3. etc etc
> 4. build something
> 5. upload to github
>
> Or am I simply totally off the track?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/27ab99d7-eab0-4b9f-9086-2ccb16292ac9%
> 40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVgb9QZJS-b2N%3Dkzd%2Bmo6WmdJSCinupiQ79MDadneC9uA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-04 Thread Ludvig F Aarstad
If someone feels up to it, any chance of dumbing down the procedure for adding 
in a missing letter in the norwegian language? I am happy tondl the legwork, 
just need to understand the concept, and I don't quite understand it when 
reading the guides.
An easy list containing the steps would do just fine.

Something like:
1. Create an image of the letter to add
2. Update wordlist
3. etc etc
4. build something
5. upload to github

Or am I simply totally off the track?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/27ab99d7-eab0-4b9f-9086-2ccb16292ac9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-02 Thread Ludvig F Aarstad

>
> Hm, in Norwegian it isn't that rare. Or at least shouldn't be ;). Æ is 
>> the uppercase version of æ, and it would never occur in the middle of a 
>> word.
>>
>  

> I find it strange that it has been left out alltogether. What must I do to 
>> get it in there?
>>
>

tirsdag 3. januar 2017 00.10.35 UTC+1 skrev Tom Morris følgende:
>
> First, the latest version is 3.04 (although there's also a tag for 3.05).
> Second, there will soon (hopefully) be a release for 4.00 which will make 
> 3.x obsolete.
>
> Having said that, it looks like the root cause of your problem is that 
> Tesseract doesn't know Æ is a possible letter for Norwegian. The training 
> text and the character frequencies have lots of occurrences of the 
> lowercase letter, but none of the uppercase. See these three files:
>
> 
> https://github.com/tesseract-ocr/langdata/blob/master/nor/nor.training_text.unigram_freqs
> 
> https://github.com/tesseract-ocr/langdata/blob/master/nor/nor.training_text
> https://github.com/tesseract-ocr/langdata/blob/master/nor/nor.wordlist
>
> That word list has 360,000 different words without a single one of them 
> containing the character. Just how rare is it? Is it something that would 
> only ever occur in the middle of a word, so you'd need to have some 
> all-caps text to be able to find it?
>
> Note that if 4.0 was trained on the same material as 3.x, it may have the 
> same problem.
>
> Tom
>
> On Monday, January 2, 2017 at 9:42:24 AM UTC-5, Ludvig F Aarstad wrote:
>>
>> Greetings and salutations fellow OCR'ers ;).
>> I have been playing around with various modules in PowerShell for reading 
>> text from an image with PowerShell but I have landed on using tesseract 
>> directly. It all works fine, and it reads like a dream :). However, it 
>> seems it is having problems with at least one of the Norwegian characters. 
>> The scanned image has the letter Æ while tesseract reads it like AE. 
>> I have tried looking into how to train it, but I haven't figured it out 
>> yet.
>> I have based my scripts on the following: 
>> http://blog.jourdant.me/post/powershell-and-tesseract-going-paperless-with-ocr
>>
>> I am grateful for any assistance.
>>
>> Ludvig F. Aarstad
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5ce64fae-498b-4331-a146-a6330487fc65%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-02 Thread Tom Morris
First, the latest version is 3.04 (although there's also a tag for 3.05).
Second, there will soon (hopefully) be a release for 4.00 which will make 
3.x obsolete.

Having said that, it looks like the root cause of your problem is that 
Tesseract doesn't know Æ is a possible letter for Norwegian. The training 
text and the character frequencies have lots of occurrences of the 
lowercase letter, but none of the uppercase. See these three files:


https://github.com/tesseract-ocr/langdata/blob/master/nor/nor.training_text.unigram_freqs
  
  https://github.com/tesseract-ocr/langdata/blob/master/nor/nor.training_text
https://github.com/tesseract-ocr/langdata/blob/master/nor/nor.wordlist

That word list has 360,000 different words without a single one of them 
containing the character. Just how rare is it? Is it something that would 
only ever occur in the middle of a word, so you'd need to have some 
all-caps text to be able to find it?

Note that if 4.0 was trained on the same material as 3.x, it may have the 
same problem.

Tom

On Monday, January 2, 2017 at 9:42:24 AM UTC-5, Ludvig F Aarstad wrote:
>
> Greetings and salutations fellow OCR'ers ;).
> I have been playing around with various modules in PowerShell for reading 
> text from an image with PowerShell but I have landed on using tesseract 
> directly. It all works fine, and it reads like a dream :). However, it 
> seems it is having problems with at least one of the Norwegian characters. 
> The scanned image has the letter Æ while tesseract reads it like AE. 
> I have tried looking into how to train it, but I haven't figured it out 
> yet.
> I have based my scripts on the following: 
> http://blog.jourdant.me/post/powershell-and-tesseract-going-paperless-with-ocr
>
> I am grateful for any assistance.
>
> Ludvig F. Aarstad
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0b896a1b-e369-4511-ac78-4652f1a26af5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.