Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2023-09-15 Thread Des Bw
I have exactly the same problem for Amharic. I find three characters missing; and they are screwing the Ocr result. Dear Shree, can you help me please? On Friday, January 6, 2017 at 3:50:38 PM UTC+3 shree wrote: > I have uploaded modified nor.traineddata at > >

Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-09 Thread Ludvig F Aarstad
I think I might stick with the postprocessing for now, too much oddities I need tonlearn to be able to compile it ;). Still, I think this project is awesome and I might take it up a notch and try the same I am doing now just using .net code :) -- You received this message because you are

Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-09 Thread ShreeDevi Kumar
Actually postprocessing with replace for AE will be the best bet as 4.0 is slower than the tesseract engine for latin-based scripts. You can experiment with 4.0.0alpha. See https://github.com/tesseract-ocr/tesseract/wiki/Compiling you will also need to compile the latest version of leptonica

Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-08 Thread Ludvig F Aarstad
No worries, I will play around and see what I can get working. For now I am using a simple replace in my script to handle the Æ. How would I go about if I were to compile tesseract 4.0 alpha using git and cmake? The wiki says the 4.0 alpha Source code is available in the master branch of the

Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-08 Thread ShreeDevi Kumar
Sorry, I am not familiar with powershell and nuget. If you are on Windows, you can try the experimental binaries for 4.0.0alpha for gimagereader, gui front-end to Tesseract-ocr. You can ocr a pdf directly or load multiple images at the same time. - excuse the brevity, sent from mobile On

Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-08 Thread Ludvig F Aarstad
Thanks Shree :D. Really appreciate it. Will this work with v3.03 too? I am basing my code on this: https://github.com/jourdant/powershell-paperless and there is a script to initialize the environment that is getting the tesseract files from here: https://nuget.org/api/v2/package/tesseract-ocr.

Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-06 Thread ShreeDevi Kumar
I have uploaded modified nor.traineddata at https://github.com/Shreeshrii/tessdata4alpha/blob/master/nor.traineddata See attached log and info file for commands used in training. It took about 9 hours on my pc - about 1700 iterations only and then my PC froze so I rebooted and created the

Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-06 Thread Peter
Den torsdag 5 januari 2017 kl. 04:39:01 UTC+1 skrev shree: > > Ray is planning to retrain the languages for the new 4.0.0 version > sometime in January. So it would be helpful if you could open an issue on > https://github.com/tesseract-ocr/langdata/issues with this information. > Is it

Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-05 Thread ShreeDevi Kumar
Tried 'Finetune' - that does not help in addition of a character. Trying 'Add a layer' now. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Jan 5, 2017 at 8:59 PM, Ludvig F Aarstad wrote: >

Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-05 Thread Ludvig F Aarstad
Fantastic, thanks:). -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to

Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-05 Thread ShreeDevi Kumar
I will give it a try and let you know. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send

Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-05 Thread Ludvig F Aarstad
I can come up with several samples, if that helps. I also realized that the occurrence of Æ in the beginning of a sentence is quite rare. It will in most cases only be for names of people (surnames mostly) and names of places and streets in addition to some specific Norwegian words that can

Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-04 Thread ShreeDevi Kumar
Ray is planning to retrain the languages for the new 4.0.0 version sometime in January. So it would be helpful if you could open an issue on https://github.com/tesseract-ocr/langdata/issues with this information. Also, if you can provide a sample representative Norwegian text including Æ, I will

[tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-04 Thread Ludvig F Aarstad
If someone feels up to it, any chance of dumbing down the procedure for adding in a missing letter in the norwegian language? I am happy tondl the legwork, just need to understand the concept, and I don't quite understand it when reading the guides. An easy list containing the steps would do

[tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-02 Thread Ludvig F Aarstad
> > Hm, in Norwegian it isn't that rare. Or at least shouldn't be ;). Æ is >> the uppercase version of æ, and it would never occur in the middle of a >> word. >> > > I find it strange that it has been left out alltogether. What must I do to >> get it in there? >> > tirsdag 3. januar 2017

[tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-02 Thread Tom Morris
First, the latest version is 3.04 (although there's also a tag for 3.05). Second, there will soon (hopefully) be a release for 4.00 which will make 3.x obsolete. Having said that, it looks like the root cause of your problem is that Tesseract doesn't know Æ is a possible letter for Norwegian.