[tesseract-ocr] Re: Gaelic tesseract-ocr

2022-11-08 Thread Peter Flynn
I believe there is now an Irish-language module. 
I am starting to look at training tesseract to recognise cló-gaelach type 
like this sign 

Has anyone ever done this before? 

On Friday, April 4, 2008 at 5:52:02 PM UTC+1 davit wrote:

> Before embarking on developing OCR solution for Gaelic and Irish
> language text and font, I wonder if anybody had attempted anything
> like this?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6d4dc0e3-b385-4210-af72-5abda8d4966dn%40googlegroups.com.


Re: [tesseract-ocr] Training Tesseract 5 with known data in tables

2022-05-22 Thread Peter Vallsten
Hi zdenop!
Thanks for your reply! Great to know that it could be done with easier 
methods, I've never quite worked with python before though!
So if I understand you correctly you mean that I could create a script that 
runs "pyautogui.locateOnScreen" for each possible name, team and position 
and when two y-values match (within a few pixels +/-) I have the position 
for that player and then create my output based on that?
It would be easier to just have to use OCR on the numbers but I'm not sure 
though how I would use that previous information to connect the time to 
each driver.
Do you have any example of the quick test that you ran how you did it?
Any suggestions how I can correctly read the numbers and connecting them to 
the names?

When I'm running OCR with command: 'tesseract C:\f1_grayscale.png test.txt 
--psm 6' it gives the output:

= hs ~Xi2, =e 3
= ~~ y - Nae
—-ORNUL
1 MONACO GRAND PRIX - SHORT QUALIFYING
a POS. DRIVER TEAM TYRE BEST GAP
Ly Advance
> 1 BK Lewis HAMILTON I Mercedes-AMG Petronas W) 1:26.147 -
Lad
is 2 im Max VERSTAPPEN Red Bull ) 1:26.383 +0.236 Amr
SS WI Race Director = ~
’ t 3 gee ValtteriBOTTAS I Mercedes-AMG Petronas W) 1:26.431 +0.284 es
Pr: ak a = = : --
” —_ : 4 5 Sergio PEREZ Red Bull WN) 1:26.538 +0.391 =——.
= Restart Session « = x 2a
3 fgg Charles LECLERC | Ferrari (w) 1:26.981 +0.834 ES te
c= . NS
Li S 6 IK Lando NorRIS I McLaren Ww) 1:27.274 +1.127 = eee
7 ——n = «67 ~—s) Daniel RICCIARDO I McLaren w) 1:27.387 +1.240 —-
: : — . Tied
_ / oN 8 Em Carlos SAINZ ! Ferrari I) 1:27.390 +1.243 C) i
= a 9 Bl) Pierre GASLY AlphaTauri iC) 1:27.427 +1.280 _ 2
maa — x 4 a
-S 10 [EM Fernando ALONSO I Alpine (w) 1:27.662 +1515 “Gane
4 ais De * Te
A : aa Yuki TSUNODA AlphaTauri w) 1:27.812 +1.665 eS
r wv ft . / 12 |) Esteban OCON I Alpine Ww) 1:27.877 +1.720 ty atone
La Le =
* } a al 153 Ql Sebastian VETTEL ! Aston Martin cD) 1:27.966 +1.819 an. sd
Se Be ae | 14 = Lance STROLL | Aston Martin wi) 1:28.119 +1.972 a f
Det oeee ly ee | LAY at Lae
; * Ves ‘om Sat F Saf = — a _— ee" pe
ea a i = : : = > Sas —
| ey Verstappen... >i Se
| y —, (X)*SELECT
bia Whe. = : por . sa ; 4 .
7 | a | = | < jee )
\ | es : i} es 8 | —— ee ema Rey
söndag 22 maj 2022 kl. 19:46:15 UTC+2 skrev zdenop:

>
> I think you made it too complicated... IMO no (re)training is not needed. 
>
> If you are working with images where you know text location you have 
> solved one big problem already.
> Working with a limited number of known text strings (players' names, 
> teams' names) gives you other (and IMHO faster) options than OCR.  I would 
> use python and pyautogui.locateOnScreen[1]. It will return the position of 
> the text at the screenshot, so you can sort and calculate the position at 
> the race. Of course, you will need OCR of the best time and maybe GAP 
> (which you can use to the check of OCR quality)
>
> Another solution would be:
>
>1. Open screenshot as grayscale
>2. Inver it (so there will be dark letters on white background,
>3. Threshold image (convert to black and white)
>4. OCR each "cell" separately
>
> I made a qick test and some times are not recognized correctly (e.g. there 
> is a missing ":" in time for Valtteri BOTTAS, but I think this could be 
> solver in python with post-processing of OCR result + GAP time result. Or 
> maybe better image preprocessing could solve it too, as I see jpg 
> artifact on the thresholded image.
>
> [1] https://pyautogui.readthedocs.io/en/latest/screenshot.html
>
> Zdenko
>
>
> ne 22. 5. 2022 o 19:04 Peter Vallsten  napísal(a):
>
>> Hi!
>> I'm trying to get started with Tesseract and OCR to make my life a bit 
>> easier. I'll try to be as descriptive as possible.
>>
>> *Basically what I'm trying to do:*
>> Me and my friends are playing F1 together over Ps5 and I have google 
>> sheets with all the stats from our races. Link to document: F1 Google 
>> Sheets stats 
>> <https://docs.google.com/spreadsheets/d/1vrQBdEDkv6dfKxCO8dtTT5qy1lSW92kXdcSByAskQOA/edit?usp=sharing>
>> Right now I'm typing in all the data myself with is super tedious and 
>> time-consuming. I want to load a screenshot into tesseract and get the data 
>> ready to copy-paste into the document and make it more automatic. (Example 
>> in the bottom of this post)
>>
>> *What I want to do:*
>> I want to parse the data from the screenshots, all the data is already 
>> known and the screenshots will be in clear 1080p pictures. I know the name 
>> of all the drivers and teams and the lap times are in the format: d:dd.ddd 
>> and the gap times are in the format: +d.ddd (possible: +dd.ddd)
>> d = integer
>> I want the output of every position 1-20, name of the driver, team, lap 
>> time & gap time to le

RE: [tesseract-ocr] Running Tesseract 5 on Linux

2022-04-03 Thread 'Peter Kronenberg' via tesseract-ocr
I’ve been using it for over a year, back when version 5 was still in Alpha.  
And I’ve never been able to figure out how to get it on Linux.  I’ve 
successfully installed on Windows, but all the package managers seem to just 
get version 4.1.1.  Currently, we use docker and we use the openjdk:8 image, 
although we’d been willing to switch if there was another Linux that Tesseract 
5 was available on

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
5250 W 116th Pl, Suite 200., Leawood, KS 66211
WWW.TORCH.AI<http://www.torch.ai/>


From: tesseract-ocr@googlegroups.com  On Behalf 
Of Zdenko Podobny
Sent: Sunday, April 3, 2022 12:50 PM
To: tesseract-ocr@googlegroups.com
Subject: Re: [tesseract-ocr] Running Tesseract 5 on Linux


Under "Linux" do you mean Linux distribution? Which one do you use?
+ I expect that before writing the post to the forum you read docs: 
https://github.com/tesseract-ocr/tessdoc/blob/main/Installation.md<https://us-east-2.protection.sophos.com?d=github.com=aHR0cHM6Ly9naXRodWIuY29tL3Rlc3NlcmFjdC1vY3IvdGVzc2RvYy9ibG9iL21haW4vSW5zdGFsbGF0aW9uLm1k=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3=SEFpZ25uUVJVNWZ4WXVUY1hjZXNMVVFtNkw2WTlOazNpNHJ2eTJvY1N1OD0==c7d9bb14b78d400f8c016d81278a26c0>
What did you tried and what problem you face?

Zdenko


ne 3. 4. 2022 o 18:38 'Peter Kronenberg' via tesseract-ocr 
mailto:tesseract-ocr@googlegroups.com>> 
napísal(a):
Has anyone had any luck installing Tesseract 5 on Linux?  It doesn’t seem to be 
available in any of the package managers

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch 
AI]<https://us-east-2.protection.sophos.com/?d=torch.ai=aHR0cDovL3d3dy50b3JjaC5haS8==NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0==c7d9bb14b78d400f8c016d81278a26c0>
5250 W 116th Pl, Suite 200., Leawood, KS 66211
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai=aHR0cDovL3d3dy50b3JjaC5haS8==NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0==c7d9bb14b78d400f8c016d81278a26c0>


--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB2686D90AAC4E928117F5B977E7E29%40MN2PR20MB2686.namprd20.prod.outlook.com<https://us-east-2.protection.sophos.com?d=google.com=aHR0cHM6Ly9ncm91cHMuZ29vZ2xlLmNvbS9kL21zZ2lkL3Rlc3NlcmFjdC1vY3IvTU4yUFIyME1CMjY4NkQ5MEFBQzRFOTI4MTE3RjVCOTc3RTdFMjklNDBNTjJQUjIwTUIyNjg2Lm5hbXByZDIwLnByb2Qub3V0bG9vay5jb20_dXRtX21lZGl1bT1lbWFpbCZ1dG1fc291cmNlPWZvb3Rlcg===NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3=VFN6QmN3WWt5MnMwR1p5VndYd0EyVXV5c0djWEEydnNRd0wvMHE1d0dIWT0==c7d9bb14b78d400f8c016d81278a26c0>.
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w7oM%3DVMGfTFgMqRo8d%3DtrdznFDVGsMvZ6hOyR9u1v3og%40mail.gmail.com<https://us-east-2.protection.sophos.com?d=google.com=aHR0cHM6Ly9ncm91cHMuZ29vZ2xlLmNvbS9kL21zZ2lkL3Rlc3NlcmFjdC1vY3IvQ0FKYnpHOHc3b00lM0RWTUdmVEZnTXFSbzhkJTNEdHJkem5GRFZHc012WjZoT3lSOXUxdjNvZyU0MG1haWwuZ21haWwuY29tP3V0bV9tZWRpdW09ZW1haWwmdXRtX3NvdXJjZT1mb290ZXI==NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3=NXIzbHIwZUZVZjkzZEd6WUNYaXFmOG53OU4yaWFuUVQvclpQdGIySWJCOD0==c7d9bb14b78d400f8c016d81278a26c0>.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB268621CD8B11D1C785AAFB92E7E29%40MN2PR20MB2686.namprd20.prod.outlook.com.


RE: [tesseract-ocr] Running Tesseract 5 on Linux

2022-04-03 Thread 'Peter Kronenberg' via tesseract-ocr
I’m not a Linux guru, but usually, I just do an apt-get install tesseract-ocr, 
and it seems to install version 4.1.1 by default.  Not sure if anything else 
needs to be done to get version 5

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
5250 W 116th Pl, Suite 200., Leawood, KS 66211
WWW.TORCH.AI<http://www.torch.ai/>


From: tesseract-ocr@googlegroups.com  On Behalf 
Of Shree Devi Kumar
Sent: Sunday, April 3, 2022 12:51 PM
To: tesseract-ocr 
Subject: Re: [tesseract-ocr] Running Tesseract 5 on Linux


Have you tried instructions on
https://tesseract-ocr.github.io/tessdoc/Installation.html<https://us-east-2.protection.sophos.com?d=tesseract-ocr.github.io=aHR0cHM6Ly90ZXNzZXJhY3Qtb2NyLmdpdGh1Yi5pby90ZXNzZG9jL0luc3RhbGxhdGlvbi5odG1s=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3=QkVvdmc5NGpXQVlkWVV4dVl4dkQ5MklvYTdSdUhpd3FyMG5ZTW5xVXg4RT0==93fa914a6b1a48f4bf91795b8abe3a66>

On Sun, Apr 3, 2022, 22:08 'Peter Kronenberg' via tesseract-ocr 
mailto:tesseract-ocr@googlegroups.com>> wrote:
Has anyone had any luck installing Tesseract 5 on Linux?  It doesn’t seem to be 
available in any of the package managers

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch 
AI]<https://us-east-2.protection.sophos.com/?d=torch.ai=aHR0cDovL3d3dy50b3JjaC5haS8==NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0==93fa914a6b1a48f4bf91795b8abe3a66>
5250 W 116th Pl, Suite 200., Leawood, KS 66211
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai=aHR0cDovL3d3dy50b3JjaC5haS8==NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0==93fa914a6b1a48f4bf91795b8abe3a66>


--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB2686D90AAC4E928117F5B977E7E29%40MN2PR20MB2686.namprd20.prod.outlook.com<https://us-east-2.protection.sophos.com?d=google.com=aHR0cHM6Ly9ncm91cHMuZ29vZ2xlLmNvbS9kL21zZ2lkL3Rlc3NlcmFjdC1vY3IvTU4yUFIyME1CMjY4NkQ5MEFBQzRFOTI4MTE3RjVCOTc3RTdFMjklNDBNTjJQUjIwTUIyNjg2Lm5hbXByZDIwLnByb2Qub3V0bG9vay5jb20_dXRtX21lZGl1bT1lbWFpbCZ1dG1fc291cmNlPWZvb3Rlcg===NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3=VFN6QmN3WWt5MnMwR1p5VndYd0EyVXV5c0djWEEydnNRd0wvMHE1d0dIWT0==93fa914a6b1a48f4bf91795b8abe3a66>.
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWsD2iKb6A4BJxjQ04UYAw8OwY-oSMW25joxt1nvr26GA%40mail.gmail.com<https://us-east-2.protection.sophos.com?d=google.com=aHR0cHM6Ly9ncm91cHMuZ29vZ2xlLmNvbS9kL21zZ2lkL3Rlc3NlcmFjdC1vY3IvQ0FHMk5kdVdzRDJpS2I2QTRCSnhqUTA0VVlBdzhPd1ktb1NNVzI1am94dDFudnIyNkdBJTQwbWFpbC5nbWFpbC5jb20_dXRtX21lZGl1bT1lbWFpbCZ1dG1fc291cmNlPWZvb3Rlcg===NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3=OER1THFBN1NSbkZweUphZE1jWUc5RTNHRDFyZnB0ek80ZGwyWGZtSDlHUT0==93fa914a6b1a48f4bf91795b8abe3a66>.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB26861531E2CEC2435DBCBD8AE7E29%40MN2PR20MB2686.namprd20.prod.outlook.com.


[tesseract-ocr] Running Tesseract 5 on Linux

2022-04-03 Thread 'Peter Kronenberg' via tesseract-ocr
Has anyone had any luck installing Tesseract 5 on Linux?  It doesn't seem to be 
available in any of the package managers

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
5250 W 116th Pl, Suite 200., Leawood, KS 66211
WWW.TORCH.AI<http://www.torch.ai/>


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB2686D90AAC4E928117F5B977E7E29%40MN2PR20MB2686.namprd20.prod.outlook.com.


Re: [tesseract-ocr] Re: Using Tesseract for Handwriting..

2021-11-21 Thread Peter Geraghty
Thank you again for your help this far. Does the RNN implemented in 
tesseract 4.0 use a connectionist temporal classification for outputs?
I seem to have difficulty navigating through the repo. Most of the 
documentation just refers to it as an RNN.

Thanks!



On Sunday, November 21, 2021 at 8:16:31 PM UTC-6 shree wrote:

> Also see the Technical Information section in 
>
> https://tesseract-ocr.github.io/tessdoc/
>
> On Mon, Nov 22, 2021, 01:36 Peter Geraghty  wrote:
>
>> Thank you!!! will do!
>>
>> On Sunday, November 21, 2021 at 12:51:51 AM UTC-6 shree wrote:
>>
>>> Please see https://github.com/tesseract-ocr/tesstrain/wiki for detailed 
>>> examples of tesseract training for handwritten texts.
>>>
>>> On Sat, Nov 20, 2021 at 11:53 AM Peter Geraghty  
>>> wrote:
>>>
>>>> sorry, by word recognition, I meant word and character localization.
>>>>
>>>> On Friday, November 19, 2021 at 11:04:38 PM UTC-6 Peter Geraghty wrote:
>>>>
>>>>> Hi everyone!
>>>>>
>>>>> Recently started a project attempting to use Tesseract for handwriting 
>>>>> recognition. Anyone's thoughts or inputs would be greatly appreciated. 
>>>>>
>>>>> Also, would contributions to this project aimed at recognizing 
>>>>> handwriting be a welcome addition? 
>>>>>
>>>>> Also, I've done some research on the components of tesseract and 
>>>>> wanted to be sure that I understood which legacy components have been 
>>>>> replaced:
>>>>>
>>>>> - The Polygon approximation algorithm has been replaced by a recurrent 
>>>>> Neural Network (character recognition)
>>>>> -  A line selector has been replaced by a neural network of some kind.
>>>>>
>>>>> For the Tesseract we're using, the NN has been replaced by a custom 
>>>>> trained model. However, my understanding is that word and character 
>>>>> recognition are still using the older algorithm.
>>>>>
>>>>> Thanks for any input,
>>>>> Peter G.
>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/33561418-416a-44ca-a25e-639f3ea3427an%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/33561418-416a-44ca-a25e-639f3ea3427an%40googlegroups.com?utm_medium=email_source=footer>
>>>> .
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/fb756dd5-251e-4367-bfa1-338970b474b9n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/fb756dd5-251e-4367-bfa1-338970b474b9n%40googlegroups.com?utm_medium=email_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5442c0ef-d829-4c25-81d4-abddc03ddc7bn%40googlegroups.com.


Re: [tesseract-ocr] Re: Using Tesseract for Handwriting..

2021-11-21 Thread Peter Geraghty
Thank you!!! will do!

On Sunday, November 21, 2021 at 12:51:51 AM UTC-6 shree wrote:

> Please see https://github.com/tesseract-ocr/tesstrain/wiki for detailed 
> examples of tesseract training for handwritten texts.
>
> On Sat, Nov 20, 2021 at 11:53 AM Peter Geraghty  
> wrote:
>
>> sorry, by word recognition, I meant word and character localization.
>>
>> On Friday, November 19, 2021 at 11:04:38 PM UTC-6 Peter Geraghty wrote:
>>
>>> Hi everyone!
>>>
>>> Recently started a project attempting to use Tesseract for handwriting 
>>> recognition. Anyone's thoughts or inputs would be greatly appreciated. 
>>>
>>> Also, would contributions to this project aimed at recognizing 
>>> handwriting be a welcome addition? 
>>>
>>> Also, I've done some research on the components of tesseract and wanted 
>>> to be sure that I understood which legacy components have been replaced:
>>>
>>> - The Polygon approximation algorithm has been replaced by a recurrent 
>>> Neural Network (character recognition)
>>> -  A line selector has been replaced by a neural network of some kind.
>>>
>>> For the Tesseract we're using, the NN has been replaced by a custom 
>>> trained model. However, my understanding is that word and character 
>>> recognition are still using the older algorithm.
>>>
>>> Thanks for any input,
>>> Peter G.
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/33561418-416a-44ca-a25e-639f3ea3427an%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/33561418-416a-44ca-a25e-639f3ea3427an%40googlegroups.com?utm_medium=email_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fb756dd5-251e-4367-bfa1-338970b474b9n%40googlegroups.com.


[tesseract-ocr] Re: Using Tesseract for Handwriting..

2021-11-19 Thread Peter Geraghty
sorry, by word recognition, I meant word and character localization.

On Friday, November 19, 2021 at 11:04:38 PM UTC-6 Peter Geraghty wrote:

> Hi everyone!
>
> Recently started a project attempting to use Tesseract for handwriting 
> recognition. Anyone's thoughts or inputs would be greatly appreciated. 
>
> Also, would contributions to this project aimed at recognizing handwriting 
> be a welcome addition? 
>
> Also, I've done some research on the components of tesseract and wanted to 
> be sure that I understood which legacy components have been replaced:
>
> - The Polygon approximation algorithm has been replaced by a recurrent 
> Neural Network (character recognition)
> -  A line selector has been replaced by a neural network of some kind.
>
> For the Tesseract we're using, the NN has been replaced by a custom 
> trained model. However, my understanding is that word and character 
> recognition are still using the older algorithm.
>
> Thanks for any input,
> Peter G.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/33561418-416a-44ca-a25e-639f3ea3427an%40googlegroups.com.


[tesseract-ocr] Using Tesseract for Handwriting..

2021-11-19 Thread Peter Geraghty
Hi everyone!

Recently started a project attempting to use Tesseract for handwriting 
recognition. Anyone's thoughts or inputs would be greatly appreciated. 

Also, would contributions to this project aimed at recognizing handwriting 
be a welcome addition? 

Also, I've done some research on the components of tesseract and wanted to 
be sure that I understood which legacy components have been replaced:

- The Polygon approximation algorithm has been replaced by a recurrent 
Neural Network (character recognition)
-  A line selector has been replaced by a neural network of some kind.

For the Tesseract we're using, the NN has been replaced by a custom trained 
model. However, my understanding is that word and character recognition are 
still using the older algorithm.

Thanks for any input,
Peter G.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/640cc701-b604-41c5-920c-792e0e0d8b72n%40googlegroups.com.


[tesseract-ocr] RE: Specify Script instead of Language

2021-01-28 Thread Peter Kronenberg
Or is the intent that the user has to move the files in the script directory up 
a level?  Is this documented anywhere?

From: Peter Kronenberg
Sent: Thursday, January 28, 2021 1:31 PM
To: tesseract-ocr@googlegroups.com
Subject: Specify Script instead of Language


This page, 
https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#LANGUAGES,
 implies that the -l option accepts the name of a language or script.  I 
assumed it would look in tessdata first and if not found, would look in 
tessdata/script.  But it seems you have to enter the path.  Is this the 
expected behavior?

For example:

[cid:image001.png@01D6F579.F3CC0390]

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB268683B17DC869494A7DD132E7BA9%40MN2PR20MB2686.namprd20.prod.outlook.com.


[tesseract-ocr] Specify Script instead of Language

2021-01-28 Thread Peter Kronenberg

This page, 
https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#LANGUAGES,
 implies that the -l option accepts the name of a language or script.  I 
assumed it would look in tessdata first and if not found, would look in 
tessdata/script.  But it seems you have to enter the path.  Is this the 
expected behavior?

For example:

[cid:image001.png@01D6F579.BEA09050]

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB26860C0EC17518642B7C90CBE7BA9%40MN2PR20MB2686.namprd20.prod.outlook.com.


RE: {EXTERNAL}[tesseract-ocr] Installing tessdata

2021-01-28 Thread Peter Kronenberg
Thanks for those links.  I think what I’m looking for is a more practical 
understanding of some of the differences, instead of technical details, which, 
not being a domain expert, I don’t fully understand.

For instance, I understand that there are 2 types of models, the LSTM OCR 
engine and the legacy engine.  What is the practical difference between the 
two.  In other words, if I go with the ‘best’ or ‘fast’ models, which only do 
LSTM OCR, what am I missing out on by not having legacy?  Is there any reason I 
would stick with the legacy models at https://github.com/tesseract-ocr/tessdata

As for the difference between ‘fast’ and ‘best’, is there any quantitative 
difference that someone can point me to?  In other words, how much better is 
‘best’ and how much more time does it take.  I guess I’m trying to decide the 
best one (no pun intended) for my application.

For the scripts, I haven’t found much definitive documentation on those.  If I 
use a Script language, is that equivalent to just specifying all the languages 
that use that script?  Is there any downside?  Do all the scripts contain 
English?   For example, if the language I’m dealing with is German, could I 
just specify Latin?  Or would it be more accurate to specify ‘deu’.  For 
something like Arabic, if I specified a script of Arabic, would that include 
Arabic, Farsi and other similar languages that use the same alphabet?  Would it 
be just as accurate as specifying the specific language?  And does the Arabic 
script contain English as well, so it could handle a mixed document?

Thank you
Peter

From: tesseract-ocr@googlegroups.com  On Behalf 
Of Shree Devi Kumar
Sent: Wednesday, January 27, 2021 8:41 PM
To: tesseract-ocr 
Subject: Re: {EXTERNAL}[tesseract-ocr] Installing tessdata

Please see

https://tesseract-ocr.github.io/tessdoc/Data-Files.html

Also the readme files in the three repos

https://github.com/tesseract-ocr/tessdata_fast


On Thu, Jan 28, 2021, 03:20 Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>> wrote:
Hi, can someone help with these questions?  Just trying to understand better 
how the language models are used and what is the difference between them.

Thanks
Peter

From: tesseract-ocr@googlegroups.com<mailto:tesseract-ocr@googlegroups.com> 
mailto:tesseract-ocr@googlegroups.com>> On 
Behalf Of Peter Kronenberg
Sent: Thursday, January 21, 2021 12:59 PM
To: tesseract-ocr@googlegroups.com<mailto:tesseract-ocr@googlegroups.com>
Subject: {EXTERNAL}[tesseract-ocr] Installing tessdata

This email was sent from outside your organisation, yet is displaying the name 
of someone from your organisation. This often happens in phishing attempts. 
Please only interact with this email if you know its source and that the 
content is safe.

CAUTION: This email originated from outside of the organization. DO NOT click 
links or open attachments unless you recognize the sender and know the content 
is safe.
I see that the default tessdata just has English and OSD.  I see all the other 
data at https://github.com/tesseract-ocr/tessdata.  Do I just copy those to the 
same tessdata directory?  The repo has a much larger version of eng.traineddata 
than what comes with Tesseract.  Can I just replace it?
And what is the difference of the ones in the script directory?

In the directory from the initial install, not only do I have eng.traineddata, 
but there is also user-patterns, user-words and other files.  Do those files 
exist for the other languages as well?
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB268642993B65C83511CFAF88E7A19%40MN2PR20MB2686.namprd20.prod.outlook.com<https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB268642993B65C83511CFAF88E7A19%40MN2PR20MB2686.namprd20.prod.outlook.com?utm_medium=email_source=footer>.
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB268647BB8BA42CE575E06764E7BB9%40MN2PR20MB2686.namprd20.prod.outlook.com<https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB268647BB8BA42CE575E06764E7BB9%40MN2PR20MB2686.namprd20.prod.outlook.com?utm_medium=email_source=footer>.
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email

RE: {EXTERNAL}[tesseract-ocr] Installing tessdata

2021-01-27 Thread Peter Kronenberg
Hi, can someone help with these questions?  Just trying to understand better 
how the language models are used and what is the difference between them.

Thanks
Peter

From: tesseract-ocr@googlegroups.com  On Behalf 
Of Peter Kronenberg
Sent: Thursday, January 21, 2021 12:59 PM
To: tesseract-ocr@googlegroups.com
Subject: {EXTERNAL}[tesseract-ocr] Installing tessdata

This email was sent from outside your organisation, yet is displaying the name 
of someone from your organisation. This often happens in phishing attempts. 
Please only interact with this email if you know its source and that the 
content is safe.

CAUTION: This email originated from outside of the organization. DO NOT click 
links or open attachments unless you recognize the sender and know the content 
is safe.
I see that the default tessdata just has English and OSD.  I see all the other 
data at https://github.com/tesseract-ocr/tessdata.  Do I just copy those to the 
same tessdata directory?  The repo has a much larger version of eng.traineddata 
than what comes with Tesseract.  Can I just replace it?
And what is the difference of the ones in the script directory?

In the directory from the initial install, not only do I have eng.traineddata, 
but there is also user-patterns, user-words and other files.  Do those files 
exist for the other languages as well?
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB268642993B65C83511CFAF88E7A19%40MN2PR20MB2686.namprd20.prod.outlook.com<https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB268642993B65C83511CFAF88E7A19%40MN2PR20MB2686.namprd20.prod.outlook.com?utm_medium=email_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB268647BB8BA42CE575E06764E7BB9%40MN2PR20MB2686.namprd20.prod.outlook.com.


[tesseract-ocr] Installing tessdata

2021-01-21 Thread Peter Kronenberg
I see that the default tessdata just has English and OSD.  I see all the other 
data at https://github.com/tesseract-ocr/tessdata.  Do I just copy those to the 
same tessdata directory?  The repo has a much larger version of eng.traineddata 
than what comes with Tesseract.  Can I just replace it?
And what is the difference of the ones in the script directory?

In the directory from the initial install, not only do I have eng.traineddata, 
but there is also user-patterns, user-words and other files.  Do those files 
exist for the other languages as well?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB268642993B65C83511CFAF88E7A19%40MN2PR20MB2686.namprd20.prod.outlook.com.


RE: [tesseract-ocr] Tesseract v 5.0 on Linux

2020-12-31 Thread Peter Kronenberg
How about getting version 4.1.1?  The version I get using apk add is 4.0.0

From: Peter Kronenberg
Sent: Thursday, December 31, 2020 2:56 PM
To: tesseract-ocr@googlegroups.com
Subject: RE: [tesseract-ocr] Tesseract v 5.0 on Linux

My understanding was that it was stable for end users and that the only reason 
it was still beta was because of API changes.  Is that not correct?
Any estimate on when it will be released?

Thank you

From: tesseract-ocr@googlegroups.com<mailto:tesseract-ocr@googlegroups.com> 
mailto:tesseract-ocr@googlegroups.com>> On 
Behalf Of Zdenko Podobny
Sent: Thursday, December 31, 2020 2:32 PM
To: tesseract-ocr@googlegroups.com<mailto:tesseract-ocr@googlegroups.com>
Subject: Re: [tesseract-ocr] Tesseract v 5.0 on Linux

Version 5 is not officially released and there are plenty of code changes 
(improvements) - API is not ready/finalized.
So the answer is: nowhere. You have to build it by yourself.

Zdenko


št 31. 12. 2020 o 20:29 Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>> napísal(a):

Is there a way to get Tesseract 5.0 on Linux without building it myself?  I'm 
on Alpine Linux.  apk add only gets me 4.0

thanks
Peter
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1b452bab-8697-4330-8333-51203fcfcae2n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/1b452bab-8697-4330-8333-51203fcfcae2n%40googlegroups.com?utm_medium=email_source=footer>.
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zPQ8aaj_nMAt7oyvb__X2O5nmNavxj9uhWN77BcVEoCw%40mail.gmail.com<https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zPQ8aaj_nMAt7oyvb__X2O5nmNavxj9uhWN77BcVEoCw%40mail.gmail.com?utm_medium=email_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB2686027A000D279784E7551EE7D60%40MN2PR20MB2686.namprd20.prod.outlook.com.


RE: [tesseract-ocr] Tesseract v 5.0 on Linux

2020-12-31 Thread Peter Kronenberg
My understanding was that it was stable for end users and that the only reason 
it was still beta was because of API changes.  Is that not correct?
Any estimate on when it will be released?

Thank you

From: tesseract-ocr@googlegroups.com  On Behalf 
Of Zdenko Podobny
Sent: Thursday, December 31, 2020 2:32 PM
To: tesseract-ocr@googlegroups.com
Subject: Re: [tesseract-ocr] Tesseract v 5.0 on Linux

Version 5 is not officially released and there are plenty of code changes 
(improvements) - API is not ready/finalized.
So the answer is: nowhere. You have to build it by yourself.

Zdenko


št 31. 12. 2020 o 20:29 Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>> napísal(a):

Is there a way to get Tesseract 5.0 on Linux without building it myself?  I'm 
on Alpine Linux.  apk add only gets me 4.0

thanks
Peter
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1b452bab-8697-4330-8333-51203fcfcae2n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/1b452bab-8697-4330-8333-51203fcfcae2n%40googlegroups.com?utm_medium=email_source=footer>.
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zPQ8aaj_nMAt7oyvb__X2O5nmNavxj9uhWN77BcVEoCw%40mail.gmail.com<https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zPQ8aaj_nMAt7oyvb__X2O5nmNavxj9uhWN77BcVEoCw%40mail.gmail.com?utm_medium=email_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/MN2PR20MB2686FD5656731892A12C3EDFE7D60%40MN2PR20MB2686.namprd20.prod.outlook.com.


[tesseract-ocr] Tesseract v 5.0 on Linux

2020-12-31 Thread Peter Kronenberg

Is there a way to get Tesseract 5.0 on Linux without building it myself?  
I'm on Alpine Linux.  apk add only gets me 4.0

thanks
Peter

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1b452bab-8697-4330-8333-51203fcfcae2n%40googlegroups.com.


Re: [tesseract-ocr] Unable to correctly OCR the image

2020-11-27 Thread Peter Jain
Hi ,
try histogram equalization method to remove the black spot from the
background , increase the dpi to 300 if it is not already and increase the
size of the image to 4 times and try .

thanks
-Peter

On Fri, Nov 27, 2020 at 12:30 AM neha sharma  wrote:

> Hi,
>
> I am trying to OCR the image, but unable to get the correct result. I
> tried few image enhancement techniques but failed. Can anyone help me with
> this. I have attached the image.
>  [image: croppedCard - Copy.png][image: croppedCard.png]
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c43bd640-110b-47f8-8ebd-12849a6fba32n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/c43bd640-110b-47f8-8ebd-12849a6fba32n%40googlegroups.com?utm_medium=email_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAEjs4B09ROHtWz5s-MVALLPdZw%2BiqRjZWgpMmFsU%2BUZrSMQFwA%40mail.gmail.com.


[tesseract-ocr] Tesseract box file not recognizing some character/word.

2020-03-10 Thread Peter


Hi,


When I create box files from Tesseract and view it from the editor, there 
are cases (attached picture) when some of the words don’t have bounding 
boxes. Is this because Tesseract cannot interpret the character/word?


[image: sample.PNG]


Currently using tesseract v5.0.0-alpha


Thanks,

Peter

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c68abd76-4b24-41a1-bfba-115a7a80ff81%40googlegroups.com.


[tesseract-ocr] How to effeciently extend the training_text file?

2019-10-10 Thread peter bence
I'm working with Arabic `langdata_lstm`, where it only has 84 lines of 
training text in the `training_text` file, where I believe it is too small 
for building/training a reliable model. After reading the `training_text` 
file I can see a randomly generated text with no meaning, first I think 
that this is an Arabic problem, but later I found that it is the same for 
all other languages.

*My questions are:*

1. What specifications are followed when generating these `training_text` 
files (I can see for example that each line is no more than 60 characters 
long, is this one of the specification?)

2. Could I simply extend the `training_text` file then generate my training 
data with custom fonts and start training directly? or there are other 
files that should be changed after changing this file? if yes, what are 
they and how to regenerate them?

Best Regards

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f40d972a-50d8-4a17-b69c-3f83271b3af8%40googlegroups.com.


[tesseract-ocr] Block detection in document header

2018-08-12 Thread Peter
Hi everyone,

does any of you know a way to make tessearct acknowledge large horizontal 
distances as separators for blocks?

Considering the attached document (it's just a random example from the web, 
tesseract shows the same behavior on similar documents). Tesseract 
consistently fails to recognize the two separate blocks in the header and 
instead reads the words line by line.

The output then looks like this:
COUR EUROPEENNE EUROPEAN COURT
des of
DROITS DE L’HOMME HUMAN RIGHTS

Where it should clearly look like this:
COUR EUROPEENNE 
des 
DROITS DE L’HOMME 

EUROPEAN COURT
of
HUMAN RIGHTS


Looking at the blocks, it becomes clear that tesseract does not recognize 
the two header blocks as separate, even though they are clearly 
distinguishable.
Is there a way to tweak tesseract's block/paragraph detection to be more 
sensitive to this and correctly separate the header blocks?

This problem has been haunting me for a while now. and tesseract is such a 
powerful tool and does such a great job with tasks that are way more 
complex, that I just cannot accept that it can't get this right.

Thanks in advance for you help,
best,
Peter


PS:
Find below the version I'm using. I do not think this is a problem of the 
version, though, the issue is the same with version 3.
tesseract 4.0.0-beta.3-199-gba757
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 
4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
Found SSE

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ec15c1c1-849a-41d9-b77a-782d5b911496%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Tesseract 4.0 extracting multiple columns where one is wanted

2018-05-02 Thread peter . bleackley
I am using Tesseract 4.0 to extract text from scanned PDF documents. I 
first use pdftoppm to split the document into pages represented as png 
files, and then use the following command to perform OCR

tesseract page.pdf stdout -l eng --psm 4

The pages generally have section numbers down the left hand side of the 
page. Sometimes, these are extracted as a column of text, and the actual 
text is extracted as a second column. Since I have set --psm 4, I am 
expecting to get the entire page returned as a single column - and indeed, 
for some pages I do get what I want.

Why is tesseract sometimes extracting the text in columns even when I tell 
it not to, and what can I do about it?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0781d032-73b7-415d-97a0-485a1c3210a6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Standalone Self-contained Tesseract-OCR for Mac

2017-04-30 Thread Peter Reid
Hi Shree

Sorry for the delay in replying but I'm struggling to get a successful 
build now.  I'm attaching my shell script for you to look at but the 
failure seems to be related to aclocal being called inside autogen.sh.

To be honest I'm not confident that things are building OK elsewhere as I 
see a variety of error and warning messages appear even though the relevant 
script finishes with "success"! I'm not a C/C+ etc coder, I do all my 
programming using LiveCode.  I'm just trying to get a reliable build of a 
portable version of Tesseract that I can drive through a command-line 
interface.  The OCR capability I'm trying to provide using Tesseract is 
just a part of a much larger app.  LiveCode allows me to build an app on my 
Mac to be deployed on Mac, Windows & Linux.  In each case the only code I 
need for each target deployment is a command line or two that runs 
Tesseract with a given set of source files that my app has 
extracted/created elsewhere. In order to make installation easy I include a 
portable version of Tesseract amongst the resources for my app.

As I'm not a C etc. coder (I last wrote serious C several decades ago!) I'm 
not able to judge which error/warning messages are significant or figure 
out how to fix them.  I was hoping to follow a recipe that would reliably 
build a portable Tesseract for the Mac and Windows.  I'm just trying 
different combinations of sub-builds until I find one that works, which is 
why I ended up with a combination of older versions of the dependencies. So 
I'm not a good person to ask to build this and report errors etc!

Best regards

Peter

On Sunday, April 23, 2017 at 7:31:23 PM UTC+1, shree wrote:
>
> Hi Peter,
> Stefan Weil has made changes to the 3.05 branch to address this issue. 
> Please give a try using the latest commit and preferably provide your 
> feedback in the issue tracker where I have added this.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/66d17f6b-da1b-45d3-897e-cd566cac570d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
#!/bin/bash

#
# Build Script for making standalone version of Tesseract
# Wes Fowlks
# 10/01/2014
# Originally posted at:https://code.google.com/p/tesseract-ocr/issues/detail?id=1326
# Original pastebin source: http://pastebin.com/VnGLHfbr

# NOTE:
# Peter Reid, 20 April 2017
# you must set 2 environment variables to locate the runnable version of Tesseract:
#   TESSDATA_PREFIX - set to the parent folder for the tessdata folder
#   DYLD_LIBRARY_PATH - set to the path up to and including the lib folder
#
# the runnable version of Tesseract consists of the folder "tesseract" with the following content (comments in brackets):
#
#  bin (folder)
#tesseract (the executable itself)
#  include (folder)
#...
#  lib (folder - pointed to by DYLD_LIBRARY_PATH)
#...
#  share (folder - pointed to by TESSDATA_PREFIX)
#man (folder)
#tessdata (folder)

BUILD_LIBJPEG=0
BUILD_ZLIB=0
BUILD_LIBPNG=0
BUILD_LEPTONICA=0
BUILD_TESSERACT=1

# Get the base directory of where the script is
BASE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
BUILD_DIR="$BASE_DIR"/build
ARCHIVE_DIR="$BASE_DIR"/archives
SRC_DIR="$BASE_DIR"/src
TESSERACT_DIR="$BASE_DIR"/tesseract

#Library Versions - unused newer versions (as of 20 April 2017):
# numerous combinations of versions have been tried but it was necessary to backtrack to earlier versions to avoid build errors
#LIBPNG_VERSION=1.6.27 -> 1.6.29

LIBJPEG_VERSION=9b
ZLIB_VERSION=1.2.11
LIBPNG_VERSION=1.6.27
LEPTONICA_VERSION=1.74.1
TESSERACT_VERSION=3.05


bail_out()
{
echo
echo "  Something went wrong, HALTING BUILD!" 
echo
exit 1
}

echo
echo
echo "==> STARTING BUILD..."
echo 

if [ ! -d "$ARCHIVE_DIR" ]; then
	mkdir "$ARCHIVE_DIR"
fi

if [ ! -d "$SRC_DIR" ]; then
	mkdir "$SRC_DIR"
fi

if [ ! -d "$BUILD_DIR" ]; then
	mkdir "$BUILD_DIR"
fi

echo "==> Base Build Directory: "
echo "-> $BUILD_DIR"


# FIRST - Build lib Libjpeg
if [[ "$BUILD_LIBJPEG" = 1 ]]
then
echo
echo "==> Building Lib Jpeg"

# Clean up old files
rm -rf $SRC_DIR/jpeg* $BUILD_DIR/jpeg*

if [ ! -f $ARCHIVE_DIR/jpegsrc.v$LIBJPEG_VERSION.tar.gz ]; then
#Download the file
  

Re: [tesseract-ocr] Re: Standalone Self-contained Tesseract-OCR for Mac

2017-04-20 Thread Peter Reid
Hi ShreeDevi

Thanks for your advice anyway, it got me thinking about the problem in 
other ways. I'm pleased to say that I managed to get a build script that 
works using the following versions of the libs:

LIBJPEG_VERSION=9b
ZLIB_VERSION=1.2.11
LIBPNG_VERSION=1.6.29
LEPTONICA_VERSION=1.71
TESSERACT_VERSION=3.04.01

I couldn't get it to work with any later versions of Leptonica (1.74.1 is 
the latest) or Tesseract (3.05.00 is the latest v3).  Despite this, I now 
have a working standalone Mac version of Tesseract that I can drive from my 
own code!

Apart from the actual files themselves, it's necessary to set 2 environment 
variables:

TESSDATA_PREFIX - set to the parent folder for the tessdata folder
DYLD_LIBRARY_PATH - set to the path up to and including the lib folder

I've included my version of the build script in case anyone else needs to 
do something similar.

Thanks again

Peter

On Tuesday, April 18, 2017 at 1:03:22 PM UTC+1, shree wrote:
>
> I haven't built 3.05 so cannot help. I would suggest that you try with 
> older commits of tesseract 3.05 branch to see which one works.
>
> Hope that those who have built 3.05 on mac will help.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9fb3ad6f-ecd2-47e7-8afb-dd9e34396af0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
#!/bin/bash

#
# Build Script for making standalone version of Tesseract
# Wes Fowlks
# 10/01/2014
# Originally posted at:https://code.google.com/p/tesseract-ocr/issues/detail?id=1326
# Original pastebin source: http://pastebin.com/VnGLHfbr

# NOTE:
# Peter Reid, 20 April 2017
# you must set 2 environment variables to locate the runnable version of Tesseract:
#   TESSDATA_PREFIX - set to the parent folder for the tessdata folder
#   DYLD_LIBRARY_PATH - set to the path up to and including the lib folder
#
# the runnable version of Tesseract consists of the folder "tesseract" with the following content (comments in brackets):
#
#  bin (folder)
#tesseract (the executable itself)
#  include (folder)
#...
#  lib (folder - pointed to by DYLD_LIBRARY_PATH)
#...
#  share (folder - pointed to by TESSDATA_PREFIX)
#man (folder)
#tessdata (folder)

BUILD_LIBJPEG=1
BUILD_ZLIB=1
BUILD_LIBPNG=1
BUILD_LEPTONICA=1
BUILD_TESSERACT=1

# Get the base directory of where the script is
BASE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
BUILD_DIR=$BASE_DIR/build
ARCHIVE_DIR=$BASE_DIR/archives
SRC_DIR=$BASE_DIR/src
TESSERACT_DIR=$BASE_DIR/tesseract

#Library Versions - unused newer versions (as of 20 April 2017):
# numerous combinations of versions have been tried but it was necessary to backtrack to earlier versions to avoid build errors
#LEPTONICA_VERSION=1.71 -> 1.74.1
#TESSERACT_VERSION=3.04.01 -> 3.05.00

LIBJPEG_VERSION=9b
ZLIB_VERSION=1.2.11
LIBPNG_VERSION=1.6.29
LEPTONICA_VERSION=1.71
TESSERACT_VERSION=3.04.01

echo
echo
echo "==> STARTING BUILD..."
echo 


echo "==> Base Build Directory: " $BUILD_DIR

# Functions usefull throughtout the script
function setupDirs() {
if [ ! -d "$ARCHIVE_DIR" ]; then
mkdir $ARCHIVE_DIR
fi

if [ ! -d "$SRC_DIR" ]; then
mkdir $SRC_DIR
fi

if [ ! -d "$BUILD_DIR" ]; then
mkdir $BUILD_DIR
fi
}

# FIRST - Build lib Libjpeg
if [[ $BUILD_LIBJPEG = 1 ]]
then
echo
echo "==> Building Lib Jpeg"
setupDirs

# Clean up old files
rm -rf $SRC_DIR/jpeg* $BUILD_DIR/jpeg*

if [ ! -f "$ARCHIVE_DIR/jpegsrc.v$LIBJPEG_VERSION.tar.gz" ]; then
#Download the file
curl -o $ARCHIVE_DIR/jpeg.v$LIBJPEG_VERSION.tar.gz http://www.ijg.org/files/jpegsrc.v$LIBJPEG_VERSION.tar.gz
fi

echo "==> Extracting archive"
tar -xzf $ARCHIVE_DIR/jpeg.v$LIBJPEG_VERSION.tar.gz -C $SRC_DIR

cd "$SRC_DIR/jpeg-$LIBJPEG_VERSION"

echo "==> Configuring Lib Jpeg for Standalone"
./configure --disable-shared --prefix=$BUILD_DIR

echo "==> Building LIBJPEG and deploying to $BUILD_DIR"
make install

#Check if the build was successful
if [ -f "$BUILD_DIR/include/jpeglib.h" ]; then 
echo "==> LIB JPEG Build Successful"
else
echo "

Re: [tesseract-ocr] Re: Standalone Self-contained Tesseract-OCR for Mac

2017-04-18 Thread Peter Reid
I'm trying to build Tesseract 3, namely version 3.05.00 or thereabouts.

In fact I started trying to build using the latest versions of all the libs 
but had several failures, so I've backtracked to earlier versions in order 
to get successful builds.  The latest release versions are the following I 
think:

ZLIB_VERSION=1.2.11
LIBJPEG_VERSION=9b
LIBPNG_VERSION=1.6.29
LEPTONICA_VERSION=1.74.01
TESSERACT_VERSION=3.05.00

My latest attempt succeeded to build zlib and libjpeg but failed with 
libpng:

Undefined symbols for architecture x86_64:
  "_inflateValidate", referenced from:
  _png_inflate_claim in pngrutil.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see 
invocation)
make[1]: *** [libpng16.la] Error 1
make: *** [check] Error 2

I've looked at the link about compiling Tesseract but it describes MacPort 
and HomeBrew only for Mac deployment, which do not generate standalone 
tesseract binaries.  My apps need to include a runnable version of 
tesseract that doesn't require any installation.  I can do this for Windows 
as the compiling web page gives the details, but I'm having to try to build 
the standalone version for the Mac myself.  This is why I'm going through 
this process!

Thanks again

Peter

On Tuesday, April 18, 2017 at 10:15:50 AM UTC+1, shree wrote:
>
> Please see https://github.com/tesseract-ocr/tesseract/wiki/Compiling
>
>
> If you are building tesseract 4.0, you need Lept 1.74
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Tue, Apr 18, 2017 at 2:25 PM, Peter Reid <peter@gmail.com 
> > wrote:
>
>> Hi ShreeDevi
>>
>> I have tried the latest version of Leptonica but I get numerous warnings 
>> (38 of them, mainly about implicit function definitions) and a fatal error 
>> 'endian.h' not found.  The build finishes saying that Leptonica has been 
>> built OK and its library appears in the lib folder.  However, when I try to 
>> build Tesseract, I get the following error:
>>
>> checking for leptonica... yes
>> checking for pixCreate in -llept... no
>> configure: error: leptonica library missing
>> Configuration done, now Building
>> make: Nothing to be done for `install'.
>> Tesseract build failed. Exiting.
>>
>> So I'm not better off with the latest version.  At least with version 
>> 1.73 I don't get the warnings and error messages when building Leptonica 
>> even though the Tesseract build fails.
>>
>> Thanks
>>
>> Peter
>>
>>
>> On Thursday, March 24, 2016 at 10:49:03 AM UTC, Peter Reid wrote:
>>>
>>> I have a standalone version of tesseract-ocr for Windows that can be run 
>>> from a folder located anywhere in the Windows filing system without having 
>>> to do an installation.  For the Mac the user has to install 
>>> HomeBrew/MacPort first and then tesseract-ocr afterwards.  This fixes 
>>> tesseract-ocr to particular parts of the OS X filing system, preventing it 
>>> from being relocated and used elsewhere on the Mac.  
>>>
>>> I'm looking for a standalone/self-contained version of tesseract-ocr for 
>>> the Mac that can be located anywhere and can be run without requiring 
>>> installations.  Please can someone point to such a version of tesseract-ocr 
>>> or give instructions on how I can build one of these!
>>>
>>> Thanks
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/a0bdea5e-9e44-4a0e-b343-e0322fffe9c3%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/a0bdea5e-9e44-4a0e-b343-e0322fffe9c3%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/909583a3-23e2-4543-a9a6-30c4dd8c6037%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Standalone Self-contained Tesseract-OCR for Mac

2017-04-18 Thread Peter Reid
Hi ShreeDevi

I have tried the latest version of Leptonica but I get numerous warnings 
(38 of them, mainly about implicit function definitions) and a fatal error 
'endian.h' not found.  The build finishes saying that Leptonica has been 
built OK and its library appears in the lib folder.  However, when I try to 
build Tesseract, I get the following error:

checking for leptonica... yes
checking for pixCreate in -llept... no
configure: error: leptonica library missing
Configuration done, now Building
make: Nothing to be done for `install'.
Tesseract build failed. Exiting.

So I'm not better off with the latest version.  At least with version 1.73 
I don't get the warnings and error messages when building Leptonica even 
though the Tesseract build fails.

Thanks

Peter


On Thursday, March 24, 2016 at 10:49:03 AM UTC, Peter Reid wrote:
>
> I have a standalone version of tesseract-ocr for Windows that can be run 
> from a folder located anywhere in the Windows filing system without having 
> to do an installation.  For the Mac the user has to install 
> HomeBrew/MacPort first and then tesseract-ocr afterwards.  This fixes 
> tesseract-ocr to particular parts of the OS X filing system, preventing it 
> from being relocated and used elsewhere on the Mac.  
>
> I'm looking for a standalone/self-contained version of tesseract-ocr for 
> the Mac that can be located anywhere and can be run without requiring 
> installations.  Please can someone point to such a version of tesseract-ocr 
> or give instructions on how I can build one of these!
>
> Thanks
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a0bdea5e-9e44-4a0e-b343-e0322fffe9c3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-06 Thread Peter


Den torsdag 5 januari 2017 kl. 04:39:01 UTC+1 skrev shree:
>
> Ray is planning to retrain the languages for the new 4.0.0 version 
> sometime in January. So it would be helpful if you could open an issue on 
> https://github.com/tesseract-ocr/langdata/issues with this information.
>

Is it possible to contribute training data for this effort? I realise 
swedish will not be on top of the list but I think it would be easy to 
involve some of the research community here in contributing training data 
if it could improve the language model.

/Peter 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9788db26-bb8a-4861-b29e-80db2b5a687f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Improve recognition in patten with single chars?

2017-01-04 Thread Peter
Hi!

We are using Tesseract (4.0 alpha) on scanned library catalog cards where 
some of the text typically is written with spacing between characters like 
in [1]: 

"A d e l s k ö l d, Johan Christian"

This makes it difficult to use word recognition in Tesseract and typically 
there will be errors in the initial letter sequence (e.g. "A d e 1 s k ö l 
d..."). Is there a way to improve this type of recognition? User patterns 
seems to require an initial 4 char letter sequence which we do not have. 
Are there other options to improve recognition?

Regards,

Peter

[1]: 
https://data.kb.se/datasets/2016/09/hs_nominalkatalog/01_A-Am/Nominal_20151207_071211_000111.jpg/

[2]: Full dataset here 
https://data.kb.se/datasets/2016/09/hs_nominalkatalog/

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f3785a98-29d1-41ef-89a8-4a970d3b1d84%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: ocr on real (dirty) printing

2015-04-16 Thread Peter Joh. Brunner
once again, with more information:

I have a problem using tesseract with german fraktur.

I work with tesseract 3.02.02 on SUSE Linux 13.2

firstly the text to be ocr'd is real printed text of about 1930.
the printing is a little dirty i.e. there are little points and strokes 
between 
the letters.
though these are far smaller than the other letters, they are interpreted 
as 
normal letters.oes-frak.frak.exp017

Is there a possibility to give parameters to tesseract that it 
. either should neglect letters which do not fit the majority of the other 
  letters, 
. or it should only use letters in a given range of size
. or to firstly make the boxes, 
  then correct the boxes, by hand or program,
  finally translate using the corrected boxes

I have already tried with a config-file to modify
  textord_min_xheight 24
  textord_xheight_mode_fraction 0.9
  textord_xheight_error_margin 0.1
  textord_descx_ratio_min 0.3
  tessedit_redo_xheight FALSE
it changes some things but nothing to neglect the points and strokes

following an example: 
the appended picture is translated to the text
  15 Ellser Exdmsund Mögsgzerg

a solution with a dictionary is not possible, because the text consists of 
only 
names of persons and locations.

Another thing i wonder is:
when i ocr an image from .tiff to .txt
and makebox of the same image
some (few) letters are different recognized!

thanks for help in advance

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0c58a26a-a8be-4550-9fca-593669a8cf5c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] ocr on real (dirty) printing

2015-04-02 Thread Peter Joh. Brunner
I have a problem using tesseract with german fraktur.

firstly the text to be ocr'd is real printed text of about 1930.
the printing is a little dirty i.e. there are little points and strokes 
between 
the letters.
though these are far smaller than the other letters, they are interpreted 
as 
normal letters.

Is there a possibility to give parameters to tesseract that it 
. either should neglect letters which do not fit the majority of the other 
  letters, 
. or it should only use letters in a given range of size
. or to firstly make the boxes, 
  then correct the boxes, by hand or program,
  finally translate using the corrected boxes

a solution with a dictionary is not possible, because the text consists of 
only 
names of persons and locations.

Another thing i wonder is:
when i ocr an image from .tiff to .txt
and makebox of the same image
some (few) letters are different recognized!

thanks for help in advance

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7a3189e9-7bf4-408b-906d-c85090c7fc8f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Building training tools from source

2014-08-01 Thread Peter Hamberg
Hi,
I'm trying to learn how to use tesseract, but I need some help, I think. 
I'm currently stuck on the training tesseract part, because for some reason 
I cant make the training tools.

I'm on a Ubuntu 14.04 machine, and I've followed the instructions 
on https://code.google.com/p/tesseract-ocr/wiki/Compiling - added the 
libraries, leptonica, 
wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz
config, make, make install, all that seems to work without any error 
messages.

But when I try

make training



all i get is the message make: Nothing to be done for `training'.
Am I in the wrong folder? What have I missed here?

// Peter

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ac1705f0-ed9e-408e-82a0-b835e4e8ad51%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Building training tools from source

2014-08-01 Thread Peter Hamberg
I knew it had to be something obvious. I didn't realise that the version of 
the source code available from the download page wasnt the latest version. 
Thanks for the clarification. Switching to the newer version worked.

On Friday, August 1, 2014 1:00:33 PM UTC+2, zdenop wrote:


 On Fri, Aug 1, 2014 at 11:06 AM, Peter Hamberg zazca...@gmail.com 
 javascript: wrote:

 Hi,
  This is a secure message chain, protected by Virtru.
  
 I'm on a Ubuntu 14.04 machine, and I've followed the instructions on 
 https://code.google.com/p/tesseract-ocr/wiki/Compiling 


 and there is written: If you want the training tools (3.03)...
 It means that that instruction are valid for 3.03 version and you are 
 compiling 3.02.02 version (where training tools are built automatically).
  

 - added the libraries, leptonica, 
 wget 
 https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz
 config, make, make install, all that seems to work without any error 
 messages.

 But when I try

 make training



 all i get is the message make: Nothing to be done for `training'.
 Am I in the wrong folder? What have I missed here?

 // Peter

  -- 
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to tesseract-oc...@googlegroups.com javascript:.
 To post to this group, send email to tesser...@googlegroups.com 
 javascript:.
 Visit this group at http://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/ac1705f0-ed9e-408e-82a0-b835e4e8ad51%40googlegroups.com
  
 https://groups.google.com/d/msgid/tesseract-ocr/ac1705f0-ed9e-408e-82a0-b835e4e8ad51%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.




-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a3af68d5-9a5e-4296-8803-8292e75efeb4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Training 7-segment display digits

2013-01-28 Thread David Peter Lisin Crespo
Good morning Seema, 

   as for training tesseract for seven segment ocr, i also asked in the 
forum but did not find a reply. In the end i simply used opencv.

Steps where:

- Convert image to black and white
- Clean the image (erosion, dilation, etc)
- Contour detection (works very well).
- Once I have countours, i made three lines. One that cuts vertically the 
contour by the half. And two horizontal lines at 1/4 of contour height and 
3/4.
- Set point were segments cross the lines to get 7 points (the vertical 
lines cut three segments) and the horizontal cut both two (upper and lower 
segments)
The check for pixel value. If black consider segment as active.
With resultng segments, you get which number is in use.

Hope this helps ;)

PD: On this page you can see similar algorithm used: 
http://www.unix-ag.uni-kl.de/~auerswal/ssocr/
Raj found how to do it with tesseract, but was not able to answer.
https://groups.google.com/forum/?fromgroups=#!topic/tesseract-ocr/elnIngFJvQs
Good paper describing similar process to what i had to do: 
http://morgoth.zemris.fer.hr/people/Marko.Cupic/files/2009-SP-MIPRO.pdf

Hope it works out for you ;)

Kind regards, 

El domingo, 27 de enero de 2013 23:43:04 UTC+1, Seema Shettar escribió:



 On Thursday, March 5, 2009 3:03:03 AM UTC-4, Raj wrote:

 Hi All 

   i'm newbie and want 2 use tesseract ocr for detecting 7-segment 
 display. 

 for this i'm using c#.net 2005 and a image processing open source 
 (opencv) and c# wrapper emgucv to achieve the task. 

  i have removed noise from the image before passing it to tesseract 
 ocr engine. 

 but i'm getting mixed results. like for digit ' 0' -- detecting as 11. 
 digit '6' as 5 

 i read about training the tesseract. Is it possible for me to train 
 the  7 segment display . 

 If yes, then please tell me the process how can i get tesseract to 
 train. 



 Thank U. 




-- 
-- 
You received this message because you are subscribed to the Google
Groups tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.




Re: Training 7-segment display digits

2013-01-28 Thread David Peter Lisin Crespo
Good morning Seema,

   as for training tesseract for seven segment ocr, i also asked in the
forum but did not find a reply. In the end i simply used opencv.

Steps where:

- Convert image to black and white
- Clean the image (erosion, dilation, etc)
- Contour detection (works very well).
- Once I have countours, i made three lines. One that cuts vertically the
contour by the half. And two horizontal lines at 1/4 of contour height and
3/4.
- Set point were segments cross the lines to get 7 points (the vertical
lines cut three segments) and the horizontal cut both two (upper and lower
segments)
The check for pixel value. If black consider segment as active.
With resultng segments, you get which number is in use.

Hope this helps ;)

PD: On this page you can see similar algorithm used:
http://www.unix-ag.uni-kl.de/~auerswal/ssocr/
Raj found how to do it with tesseract, but was not able to answer.
https://groups.google.com/forum/?fromgroups=#!topic/tesseract-ocr/elnIngFJvQs
Good paper describing similar process to what i had to do:
http://morgoth.zemris.fer.hr/people/Marko.Cupic/files/2009-SP-MIPRO.pdf

Hope it works out for you ;)

Kind regards,

   David Lisin


2013/1/27 Seema Shettar seema.shet...@gmail.com



 On Thursday, March 5, 2009 3:03:03 AM UTC-4, Raj wrote:

 Hi All

   i'm newbie and want 2 use tesseract ocr for detecting 7-segment
 display.

 for this i'm using c#.net 2005 and a image processing open source
 (opencv) and c# wrapper emgucv to achieve the task.

  i have removed noise from the image before passing it to tesseract
 ocr engine.

 but i'm getting mixed results. like for digit ' 0' -- detecting as 11.
 digit '6' as 5

 i read about training the tesseract. Is it possible for me to train
 the  7 segment display .

 If yes, then please tell me the process how can i get tesseract to
 train.



 Thank U.


  --
 --
 You received this message because you are subscribed to the Google
 Groups tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en





-- 
-- 
You received this message because you are subscribed to the Google
Groups tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.




Re: Training Tesseract for single digit

2013-01-08 Thread David Peter Lisin Crespo
Good afternoon Sunitha,

even trying the single digit with -psm 10 digits option also fails?

2013/1/8 sunitha raghurajan sunitha.raghura...@gmail.com

 Tesseract

-- 
You received this message because you are subscribed to the Google
Groups tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en


Re: How to improve an existing language?

2011-05-19 Thread Peter Alberti
I forgot to add:
If you are willing to release your files under the apache license,
version 2.0, then you are more than welcome to send them to me and I
will try and see if I can build a .traineddata-file that's ready for
testing, using both your files and mine. With a bit of luck, it will
probably be very easy for me.

Best regards,
Peter

-- 
You received this message because you are subscribed to the Google
Groups tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en


Re: How to improve an existing language?

2011-05-16 Thread Peter Alberti
Hi Holger

If there are plenty of s's and t's on the page, it is no problem to
skip a
couple. Another possible strategy is to create one box for both
letters
('st'), similar to what you've probably done for the 'ch' ligature.

From a quick look at your image, I suspect you might get problems
with overlapping boxes when you try use it for training. ('lt' in
enthalten
looks problematic). But try running tesseract with the box.train
command and see what happens.

I've always used as many sets of .tif/.box-files as I had available.
For
the version of deu-frak available in the downloads section, I think it
was
about 25. For the newest version from
https://github.com/paalberti/tesseract-dan-fraktur/tree/master/deu-frak
it is 32, I think.

Best regards,
Peter

On 15 Maj, 23:56, stinguin holgerkoe...@googlemail.com wrote:
 Hi all,

 I was diligent and build a new wordlist and some new box-files. Can
 you take a look on my boxes before I use them to create a new
 traineddata? Because there are different fonts and because of some
 letters are to close to seperate them (e.g. 's' next to 't') I
 couldn't make a box for each letter as you can see here:

 http://s1.directupload.net/file/d/2525/kheno9zf_jpg.htm

 Is it bad that I ignore some characters of the original page or is
 it OK? Would it be better to use a bitonal scan? And what's better,
 slim boxes or boxes with some space around the letters?

 Many thanks in advance (as usual;-)

 Holger

 @ Peter: Can you tell me of how many boxfiles the official deu-frak-
 language consist. Only of the 8 deu-frak ones?


-- 
You received this message because you are subscribed to the Google
Groups tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en


Re: How to improve an existing language?

2011-04-13 Thread Peter Alberti
Hi

deu-frak.traineddata is a file I created, so I'm happy to hear that
someone might want to improve it.
Actually, I've continued to work a little bit on it myself, and you
can get the files I'm using from

https://github.com/paalberti/tesseract-dan-fraktur

The files you find there ought to be little bit better than deu-
frak.traineddata available under downloads, but I haven't done any
proper testing yet, so your mileage may vary. Also, the tif/box in the
dan-frak/ subdirectory might work slightly better than those under deu-
frak/ (Danish is the language I'm most interested in), so if you want
to retrain yourself, you might to work with those.

The two most obvious improvements, I can think of is to add to some
tif/box that look more like the texts you're ocr-ing, if possible, and
maybe to build a better wordlist (if I remember correctly, the German
one was a little bit of quick hack.)

Best regards, Peter.

On 12 Apr., 22:09, stinguin holgerkoe...@googlemail.com wrote:
 Hi list,

 I'm new to tesseract and hope that anyone of you could help me. I want
 to ocr some german texts which are typesetted in fraktur. The results
 by using the existing language deu-frak are good, but not good
 enough. Is it possible to improve this language by training? If so,
 can someone explain that step by step?
 I just know how to create a new language. Do you think i can improve
 the results by creating my own one? I think the deu-frak-language is
 more than just a few box files, isn't it?

 Thanks in advance

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Danish fraktur support in r319

2010-05-24 Thread Peter Alberti
On May 24, 1:46 pm, Lars Aronsson l...@aronsson.se wrote:
 Peter Alberti wrote:
  I've trained tesseract r319 (3.0) to support Danish texts written in 
  fraktur. It is not
  perfect but good enough that I hope it may be useful to others.

 This is great! The file dan-frak.traineddata is a binary file.
 Tesseract is an open source software. Is there some
 documentation for this file format, so I can read and
 understand what's in there? I want to keep the part
 that is about fraktur/blackletter and substitute the
 part that is about Danish pre 1870 spelling for
 something based on my Swedish dictionaries.

If Jimmy's method for extracting components doesn't work for you, let
me know and I'll try and put my input files together and post them.

I just think I have to add that making a Swedish version of it will
probably be a bit more cumbersome than it might seem at first as I
haven't included enough fraktur letters for your purposes (there's no
ä, ö or å).

Best regards,
Peter

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-...@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Problems with digits

2009-10-06 Thread Hans Peter Bremer
Hello Tesseract specialists,

I exclusively would like such samples with Tesseract on faxes recognize.

*AXL* HV09AM1*123456
*AXL* HV09AM2*456789
*AXL* AMO09HM1*786543



To this I have prepared several training files. I append one of this here.
The others are similar though still grease and italic.

After the training I still have in Freq-dawg such character strings as typed
above.
At the letters (*AXL* HV09AM1* etc.) Tesseract recognizes everything very
well.

But it gives great problems at the digits.

With the originally deu. language files works the digits fine, but Tesseract
does not recognize the signs any more.

How can I training the numbers better? What do I make wrong? Do I have to
separate digits and signs generally?

How can I exclude special signs and lower case letters? I do not have found
any chapter in the FAQ for this.

I hope somebody can me help.

Greetings HaPe

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~--~~~~--~~--~--~---

attachment: timesnr_14n.tif

Re: special sign exclude

2009-10-05 Thread Hans Peter Bremer
Hello Ray,

in which chapter of the FAQ?

I cannot find it.

Greetings HaPe

2009/10/3 Ray Smith theraysm...@gmail.com

 See the faq page.
 Ray.

 On Oct 2, 2009 12:03 AM, HaPe hapebre...@googlemail.com wrote:


 Hello,

 is it possible to exclude particular sign (special sign) from the
 reconnaissance?

 Greetings HaPe


 


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~--~~~~--~~--~--~---



Training with fonts

2009-09-29 Thread Hans Peter Bremer
Hello,

my to translating document are provided everybody in the same font (Times
New Roman 14).

Can I give in the training this font?

Can I get better results for this?

Greeting HaPe

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~--~~~~--~~--~--~---



Problem with boxfile

2009-07-28 Thread Hans Peter Bremer
Hi,

i've got a problem with the creating of a boxfile or rather with the result.

##
Here's the tif :

http://img265.imageshack.us/i/981501crop1.tif/

###

And here the result :

* 33 16 49 26
A 58 11 87 26
B 93 11 119 26
C 127 10 153 26
D 160 11 189 26
E 195 11 220 26
F 227 11 248 26
~ 256 7 939 26
1 951 11 962 25
2 973 11 993 25
3 1000 10 1018 25
4 1027 11 1046 25
5 1054 10 1071 26
6 1081 10 1100 26
7 1107 10 1126 25
8 1135 10 1152 26
9 1160 10 1179 26
0 1187 10 1206 26
* 1215 16 1231 26
#

The letters H-Z are ignored.

I don't know what went wrong?
I hope somebody can help me.

Greetz

HaPe

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~--~~~~--~~--~--~---



Re: Debian-AMD64: Java Error

2008-11-02 Thread Peter Davoust

config.status: error: cannot find input file: java/Makefile.in
I suspect this is because there is no java for the 64bit
architecture.

I think that's not the problem, because I was able to configure
tesseract on amd64. Have you checked to see if that file actually
exists ? I think if it were a java problem it would have said
something about java itself, this sounds like something else.

-- 
Peter Davoust ([EMAIL PROTECTED])
worldgnat.blogspot.com

Don't mistake your watermelon for the universe
-Kenn Amdahl

Nature is designed to work - let it.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~--~~~~--~~--~--~---



Re: Debian-AMD64: Java Error

2008-11-02 Thread Peter Davoust

I got the same output... I think the image has to have quite a bit of
space between the characters for it to work. I haven't had much time
to mess with it though.

On Sun, Nov 2, 2008 at 10:45 PM, charlesrkiss [EMAIL PROTECTED] wrote:



 I got tesseract-ocr.tar.gz to unpack, configure, make, sudo make
 install  and... print out the phototest.tif to a text file...


 but my scanned image.tif/tiff creates an empty text file!  BUMMER!!
 Almost thought I had it working.  Does it now need some training???

 I tried to get ocropus to install, but I don't need the html output.

 Thanks!!
 




-- 
Peter Davoust ([EMAIL PROTECTED])
worldgnat.blogspot.com

Don't mistake your watermelon for the universe
-Kenn Amdahl

Nature is designed to work - let it.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~--~~~~--~~--~--~---



Re: .txt file is EMPTY

2008-11-02 Thread Peter Davoust

The issue is not that test.txt isn't being created, the issue is that
test.txt is empty when it is created.

On Mon, Nov 3, 2008 at 12:31 AM, 74yrs old [EMAIL PROTECTED] wrote:
 have you run commandline as follow:
 tesseract.exe phototest.tif  test   The output will be
 test.txt.

 On Mon, Nov 3, 2008 at 9:17 AM, charlesrkiss [EMAIL PROTECTED] wrote:

 I got tesseract-ocr.tar.gz to unpack, configure, make, sudo make
 install  and... print out the phototest.tif to a text file...

 but my scanned image.tif/tiff creates an empty text file!  BUMMER!!
 Almost thought I had it working.  Does it now need some training???

 I tried to get ocropus to install, but I don't need the html output.

 Any ideas??

 Thanks!!



 




-- 
Peter Davoust ([EMAIL PROTECTED])
worldgnat.blogspot.com

Don't mistake your watermelon for the universe
-Kenn Amdahl

Nature is designed to work - let it.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~--~~~~--~~--~--~---