from:"Zdenko Podobny"

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-09-07 Thread Zdenko Podobny

tesstrain is a tested method to train/improve tesseract language mode. It
creates box files for you.
You can try your ways, but your problems are your problems and you should
not to expect somebody will adjust the code to your needs.
Of course, you are welcome to contribute your solution.

Zdenko


so 7. 9. 2024 o 3:55 'Danny' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> I think this google group is having technical troubles.  I got an email
> about a new post from Menelik Berhan but his message doesn't appear on the
> web.  He said:
>
>
>
> *| This might be
> helpful: https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html
> | And
> also some details
> in: 
> https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#making-box-files
> *
>
> Same what Tom said. Very helpful!
>
> To summarize:
> - Box files always contain one line per character
> - There are two kinds of box files: *per-character* and *per-line* box
> files
> - per-character box files have separate coordinates for each character
> - per-line box files still have one line per character, but the
> coordinates are always the same and represent the bounding box of the
> entire text
>
> The training code, specifically *Tesseract::TrainFromBoxes(), **should* accept
> either format.
>
> As mentioned in this and other posts, the box identification for Chinese
> seems to be quite broken. Like this:
> [image: Screenshot 2024-08-05 at 17.56.12.png]
>
> That might or might not be a training issue, but I will try retraining the
> model using *per-line* box files and see if that makes any difference.
>
> Thanks to all.
>
> On Friday, September 6, 2024 at 11:18:44 PM UTC+8 tfmo...@gmail.com wrote:
>
>> That's weird. I posted an answer to this thread yesterday and now, in
>> it's place, Google Groups says "Message has been deleted." Let me try
>> again...
>>
>> This page
>> https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html
>> says "lstmbox - Generated by tesseract using lstmbox config from image
>> files - each char uses coordinates of its entire line. This format is also
>> generated by the tesstrain makefile."
>>
>> Tom
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/b0c4e374-2f79-486f-acb4-acf686119ba2n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yMv14YDKcP%2BgzRsrn6iUXiTFVEiD_-q9YtccNGmbuNKA%40mail.gmail.com.

Re: [tesseract-ocr] Remove the thin horizontal line

2024-09-06 Thread Zdenko Podobny

have a look at http://www.leptonica.org/line-removal.html
The source code is here:
https://github.com/DanBloomberg/leptonica/blob/master/prog/lineremoval_reg.c

Zdenko


pi 6. 9. 2024 o 11:08 Sundar Andaperumal  napísal(a):

> Hi,
>
>  I am trying to remove the thin horizontal line; when doing so the text in
> the SUBTOTAL
> gets disturbed and gives special characters like this:  (`°`, `—`, `~`,
> `*`, etc.)
>
>  How to ignore / remove this horizontal line and extract the proper text
> in the SUBTOTAL section. Image attached.
>
> thanks!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6ca4d72e-6dac-4db9-8d25-abbe20e5ffd3n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x_qaX2u8yg-0ay0B0J3hC5Jk3LbmS8S0QsW3mbHMTU2g%40mail.gmail.com.

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-09-05 Thread Zdenko Podobny

What about reading tesstrain Readme and using the example data to
understand the training process better?

Zdenko


št 5. 9. 2024 o 17:41 'Danny' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> Hi Zdenko,
> Thanks for the response.  However, ocrd-testset.zip contains training
> images and ground truth text without boxes.
>
> True, the images contain a full line of text:
> [image: alexis_ruhe01_1852_0099_012.png]
>
> But there are no box files in the training set.
>
> I'd like to confirm if the LSTM training set's xxx.box file is expected
> contain one box per line (wrapping the entire line) or one box per
> character in the line...  Any insight?
>
> On Thursday, September 5, 2024 at 9:15:12 PM UTC+8 zdenop wrote:
>
>> have a look at provided example  ocrd-testset.zip
>> 
>>
>> Zdenko
>>
>>
>> ut 3. 9. 2024 o 16:04 'Danny' via tesseract-ocr <
>> tesser...@googlegroups.com> napísal(a):
>>
>>> @zdenop wrote:
>>> | Tesseract LSTM engine (tesseract >=v4) training script is based on
>>> lines (group of words)
>>> | Box files reflect that. And yes - box files are important.
>>>
>>> Zdenko, does this mean a "box file" for LSTM training should wrap the
>>> entire text line and NOT the individual characters?
>>> Which is correct for LSTM training:
>>>
>>> A) individual boxes like this, or
>>> [image: sub_2.png]
>>> B) One box for entire line:
>>> [image: sub_2 line.png]
>>> Thanks.
>>>
>>> On Sunday, July 14, 2024 at 9:05:48 PM UTC+8 zdenop wrote:
>>>
 Ehm:

1. Tesseract v3 (legacy) engine training is based on characters.
2. Tesseract LSTM engine (tesseract >=v4) training script is based
on lines (group of words)

 Box files reflect that. And yes - box files are important.


 Zdenko


 pi 12. 7. 2024 o 14:14 Mateusz Matela 
 napísal(a):

> As an experiment, I run the training on a small sample produced with
> text2image. Then I converted the .box files so that each character is
> assigned common bounding rectangle from all the characters and run the
> training again. The outputs were identical in both cases. Then I removed
> the box file and let the training script autogenerate them. In that case
> the reported error rates were crazy, like 99% instead of 0.5%.
> This suggests that conclusion 3 is correct.
>
> środa, 10 lipca 2024 o 15:17:07 UTC+2 Mateusz Matela napisał(a):
>
>> Hi all,
>>
>> Sorry if double posting, my previous message didn't appear and I
>> don't see any info about waiting for acceptance or something.
>> I was searching for this topic in this forum and it was mentioned a
>> few times, but I couldn't find a clear and definitive explanation.
>>
>> How does the information put in the .box files affect the training
>> process? The file contains coordinates for each character in the txt 
>> file,
>> but the documentation says that since Tesseract 4.0 the model operates on
>> the level of whole lines. Some tools like text2image generate the .box
>> files with accurate coordinates for each character. When the .box files 
>> are
>> missing the tesstrain Makefile generates them using generate_line_box.py,
>> which assigns the same full image area to each character.
>>
>> I see 3 possible conclusions, which one is closest to the truth?
>>
>> 1. The .box files do not affect the LSTM training at all and are just
>> a leftover from the times of Tesseract 3. In that case, ideally in the
>> future they could be completely dropped or only required/generated when
>> specifically working with the legacy engine.
>>
>> 2. There is still a chance that training will work better with exact
>> coordinates and the generate_line_box.py is just a cheap workaround that
>> could be improved on in the future.
>>
>> 3. The .box file is still important in case you prefer to define the
>> coordinates for the text in the image instead of cropping the image. The
>> granularity of the coordinates is not imporant as Tesseract will just 
>> work
>> on a box that encapsulates all of the character boxes. Even if confusing,
>> this approach is still better than having a different .box file formats 
>> for
>> LSTM and the legacy engine.
>>
>> I'll be grateful for any wisdom on this.
>>
>> Thanks
>> Mateusz
>>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/b17225d5-2b78-41bd-994f-05305b9a443dn%40googlegroups.com
>

Re: [tesseract-ocr] Tesseract 5 with dnf

2024-09-05 Thread Zdenko Podobny

No. We do not distribute binary packages. Volunteers create and maintain
them.

Zdenko


št 5. 9. 2024 o 20:56 Chris Crutts (agentc313) 
napísal(a):

> on my Oracle Linux 8.10 distribution, doing
>
> $ sudo dnf install tesseract
>
> installs tesseract version 4.1.1-2.el8 and leptonica version 1.76.0-2.el8
>
> As of today, 9/5/2024, the newest version is Release 5.4.1 ·
> tesseract-ocr/tesseract (github.com)
> 
>
> I am curious as to why the newest version able to be installed via dnf is 
> Release
> 4.1.1  which
> was released late 2019.
>
> I found that you can install from source, or by using the Snap Store
> , but I want to use dnf.
>
> Are there any plans to update the dnf package in the future?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/7db00879-c247-4065-b5d8-e8220d84826cn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zL2-BUrtaBqMCMdDEx7MZScMucrhA2buhypsiznD7baw%40mail.gmail.com.

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-09-05 Thread Zdenko Podobny

have a look at provided example  ocrd-testset.zip

Zdenko

ut 3. 9. 2024 o 16:04 'Danny' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> @zdenop wrote:
> | Tesseract LSTM engine (tesseract >=v4) training script is based on lines
> (group of words)
> | Box files reflect that. And yes - box files are important.
>
> Zdenko, does this mean a "box file" for LSTM training should wrap the
> entire text line and NOT the individual characters?
> Which is correct for LSTM training:
>
> A) individual boxes like this, or
> [image: sub_2.png]
> B) One box for entire line:
> [image: sub_2 line.png]
> Thanks.
>
> On Sunday, July 14, 2024 at 9:05:48 PM UTC+8 zdenop wrote:
>
>> Ehm:
>>
>>1. Tesseract v3 (legacy) engine training is based on characters.
>>2. Tesseract LSTM engine (tesseract >=v4) training script is based on
>>lines (group of words)
>>
>> Box files reflect that. And yes - box files are important.
>>
>>
>> Zdenko
>>
>>
>> pi 12. 7. 2024 o 14:14 Mateusz Matela  napísal(a):
>>
>>> As an experiment, I run the training on a small sample produced with
>>> text2image. Then I converted the .box files so that each character is
>>> assigned common bounding rectangle from all the characters and run the
>>> training again. The outputs were identical in both cases. Then I removed
>>> the box file and let the training script autogenerate them. In that case
>>> the reported error rates were crazy, like 99% instead of 0.5%.
>>> This suggests that conclusion 3 is correct.
>>>
>>> środa, 10 lipca 2024 o 15:17:07 UTC+2 Mateusz Matela napisał(a):
>>>
 Hi all,

 Sorry if double posting, my previous message didn't appear and I don't
 see any info about waiting for acceptance or something.
 I was searching for this topic in this forum and it was mentioned a few
 times, but I couldn't find a clear and definitive explanation.

 How does the information put in the .box files affect the training
 process? The file contains coordinates for each character in the txt file,
 but the documentation says that since Tesseract 4.0 the model operates on
 the level of whole lines. Some tools like text2image generate the .box
 files with accurate coordinates for each character. When the .box files are
 missing the tesstrain Makefile generates them using generate_line_box.py,
 which assigns the same full image area to each character.

 I see 3 possible conclusions, which one is closest to the truth?

 1. The .box files do not affect the LSTM training at all and are just a
 leftover from the times of Tesseract 3. In that case, ideally in the future
 they could be completely dropped or only required/generated when
 specifically working with the legacy engine.

 2. There is still a chance that training will work better with exact
 coordinates and the generate_line_box.py is just a cheap workaround that
 could be improved on in the future.

 3. The .box file is still important in case you prefer to define the
 coordinates for the text in the image instead of cropping the image. The
 granularity of the coordinates is not imporant as Tesseract will just work
 on a box that encapsulates all of the character boxes. Even if confusing,
 this approach is still better than having a different .box file formats for
 LSTM and the legacy engine.

 I'll be grateful for any wisdom on this.

 Thanks
 Mateusz

>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/b17225d5-2b78-41bd-994f-05305b9a443dn%40googlegroups.com
>>> 
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ba9b210d-a38e-446d-80e1-4d22b213f210n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x7yj8OFQ9R0JqRUeTmexj

Re: [tesseract-ocr] Issue with Tesseract OCR: Difficulty Detecting White Text on Blue Background

2024-08-22 Thread Zdenko Podobny

Tesseract is the OCR engine and it is not a text detection tool.
If you pass just blue button to tesseract, it has no problem to extract
text:

tesseract blue_button.png -
Sign in


Zdenko


št 22. 8. 2024 o 9:11 Abdul Kalam Shaik 
napísal(a):

> Thanks Ger for your response. So, my use case is like when ever there is a
> colored background I'm unable to detect the text. Attached few use cases
> where I was facing difficulty in detecting the text.
>
> Regards,
>
> Shaik Abdul Kalam.
>
> On Tuesday, August 20, 2024 at 4:13:42 PM UTC+5:30 ger.h...@gmail.com
> wrote:
>
>> Generally, it is best to convert to greyscale with black text on white
>> background. Seems you tried that so questions remain.
>> Please include one or two sample images which exhibits your problem, so
>> folks around here have something to test against.
>>
>> Ciao,
>>
>> Ger
>>
>> On Mon, 19 Aug 2024, 18:45 Abdul Kalam Shaik, 
>> wrote:
>>
>>> Hello,
>>>
>>> I am encountering an issue with Tesseract OCR when trying to detect
>>> white text on a blue background. Despite various preprocessing techniques,
>>> the OCR is not accurately recognizing the text on this specific background.
>>>
>>> *Details:*
>>>
>>> Tesseract Version: tesseract v5.0.0-alpha.20210506
>>> Language Pack: English
>>> *Image Characteristics:*
>>> Background color: Blue
>>> Text color: White
>>> Image resolution: 1920X1080P
>>> Image format:PNG
>>> *Preprocessing Techniques Applied:*
>>> 1. Grayscale conversion
>>> 2. Contrast adjustment
>>> 3. Binary thresholding
>>> 4. Inversion of the image
>>> 5. Morphological operations
>>> 6. Increase Contrast
>>> 7. ROI
>>> 8. Convert the image to the HSV color space, Create a mask to isolate
>>> blue regions,Invert the mask to focus on the text and Using the mask to
>>> extract the white text
>>> *  Script/Code Used:*
>>> import cv2
>>> import pytesseract
>>> import pyautogui
>>> import time
>>> import numpy as np
>>>
>>> # Specify the path to the Tesseract executable if not in PATH
>>> pytesseract.pytesseract.tesseract_cmd = r'C:\Program
>>> Files\Tesseract-OCR\tesseract.exe'
>>>
>>>
>>> def preprocess_image_gray(image):
>>> # Convert to grayscale
>>> gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
>>> cv2.imshow("Gray Scale Image", gray)
>>> cv2.waitKey(0)
>>> cv2.destroyAllWindows()
>>> return gray
>>>
>>>
>>> def preprocess_image_increase_contrast(image):
>>> # Increase contrast
>>> contrast = cv2.convertScaleAbs(image, alpha=1.5, beta=0)
>>> cv2.imshow("Increase contrast", contrast)
>>> cv2.waitKey(0)
>>> cv2.destroyAllWindows()
>>> return contrast
>>>
>>>
>>> def preprocess_image_gaussian_blur(image):
>>> # Apply Gaussian blur
>>> blurred = cv2.GaussianBlur(image, (5, 5), 0)
>>> cv2.imshow("GaussianBlur", blurred)
>>> cv2.waitKey(0)
>>> cv2.destroyAllWindows()
>>> return blurred
>>>
>>>
>>> def preprocess_image_edge_detection(image):
>>> # Perform edge detection
>>> edged = cv2.Canny(image, 50, 150)
>>> cv2.imshow("edge detection", edged)
>>> cv2.waitKey(0)
>>> cv2.destroyAllWindows()
>>> return edged
>>>
>>>
>>> def preprocess_image_inverted(image):
>>> # Invert the image
>>> inverted_image = cv2.bitwise_not(image)
>>> cv2.imshow("Inverted Image", inverted_image)
>>> cv2.waitKey(0)
>>> cv2.destroyAllWindows()
>>>
>>> return inverted_image
>>>
>>>
>>> def preprocess_image_dialte_edges(image):
>>> # Dilate the edges
>>> dilated = cv2.dilate(image, None, iterations=2)
>>> cv2.imshow("dilate", dilated)
>>> cv2.waitKey(0)
>>> cv2.destroyAllWindows()
>>>
>>> # Bitwise-AND mask and original image
>>> result = cv2.bitwise_and(image, image, mask=dilated)
>>> cv2.imshow("Bitwise-AND mask and original image", result)
>>> cv2.waitKey(0)
>>> cv2.destroyAllWindows()
>>>
>>> # Invert the image
>>> inverted_image = cv2.bitwise_not(result)
>>> cv2.imshow("Inverted Image", inverted_image)
>>> cv2.waitKey(0)
>>> cv2.destroyAllWindows()
>>>
>>> return inverted_image
>>>
>>>
>>> def perform_ocr(image_path, text_to_find=None, config="--psm 6 --oem 3",
>>> preprocess_func=preprocess_image_gray):
>>> global ocr_results
>>> try:
>>> image = cv2.imread(image_path)
>>> image_preprocessed = preprocess_func(image)
>>> image_rgb = cv2.cvtColor(image_preprocessed, cv2.COLOR_BGR2RGB)
>>> ocr_data = pytesseract.image_to_data(image_rgb,
>>> output_type=pytesseract.Output.DICT, config=config)
>>>
>>> if text_to_find is not None and not isinstance(text_to_find,
>>> list):
>>> text_to_find = [text_to_find]
>>>
>>> ocr_results = []
>>> for i in range(len(ocr_data['text'])):
>>> text = ocr_data['text'][i].strip()
>>> if not text:
>>> continue
>>>
>>> confidence = float(ocr_data['conf'][i]) / 100.0  # Convert
>>> confidence to decimal
>>> if con

Re: [tesseract-ocr] Converting colored background and colored characters to text with the Tesseract library

2024-08-04 Thread Zdenko Podobny

Captcha was created to fool OCR.


Zdenko


po 5. 8. 2024 o 7:27 Emre Batu  napísal(a):

> [image: 20240804211345.png]  Hello everyone. I am using the Tesseract
> library in a C# application to analyze images. However, the image I want to
> convert to text contains colored characters and a colored background. As a
> result, the output is not accurate. How can I convert this image to text
> correctly? Thank you.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/f0b4bb22-6e1a-41ab-b38a-d31440c12074n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xYCESSrCjO%2BHcx6cenpT39xCMEw636DMyzALmuBY_KGQ%40mail.gmail.com.

Re: [tesseract-ocr] Re: How to prevern Tesseract from interpreting noise as characters

2024-08-04 Thread Zdenko Podobny

tesseract unnamed.jpg -
Estimating resolution as 182

 e.g. no recognized word... So the problem could be in the parameters you
used for OCR...

Before OCR I suggest image preprocessing and maybe the detection of empty
pages.
Have a look at leptonica example for Normalize for uneven illumination
(pixBackgroundNorm in
https://github.com/DanBloomberg/leptonica/blob/master/prog/livre_adapt.c)
and then binarize image.
I think with some more "aggressive" parameters you can get a clean empty
page, so will not need to modify your OCR parameters...

Zdenko


ne 4. 8. 2024 o 13:22 Iain Downs  napísal(a):

> In the event that anyone else has a similar issue, this is how I
> approached it.
>
> Firstly, make a histogram of the number of pixels with each intensity (so
> an array of 256 numbers).
>
> When you inspect this you get results like the below.
>
> [image: Finding empty pages.png]
>
> This is after a little smoothing and taking the log of the values.
>
> You can see that the properly blank pages show little or no very dark
> (black) pixels, whereas the pages with some text, even if a small amount
> have a fair number.
>
> I simply set a cutoff level (in this case 1) and a cutoff intensity (in my
> case 80), so providing the first peak of 1 of the log smoothed intensity is
> below 80 it is text, otherwise it is blank.
>
> You can also see the problem which tesseract has (with default
> binarisation) in that the intensity is distinctly bimodal.  I think this is
> due to bleedthrough from the reverse of the page.  Of course that is
> essentially what OTSU uses pick out 'black' from 'white'.
>
> Iain
> On Tuesday, July 16, 2024 at 5:38:02 PM UTC+1 Iain Downs wrote:
>
>> I'm working on processing scanned paperback books with tesseract (C++ API
>> at the moment).  One issue I've found is that when a page has little or no
>> text tesseract gets overkeen and interprets the noise as text.
>>
>> The image below is the raw page.  In this case it's the inside front
>> cover of a book.
>> [image: HookRawPage.jpg]
>> This is the image after tesseract has processed it (binarization) and
>> before the character recognition.
>> [image: HookPostProcessed.jpg]
>>
>> tesseract suggests that there are 160 or so words (by some definition of
>> word!) on this page as per the attached (Hook02Small.txt).
>>
>> This also happens on pages which DO contain text but a small amount.  I
>> suspect that the binarization (possibly OTSU?) is to blame.  I can probable
>> do something to detect entirely blank pages, but less sure what do do with
>> mainly blank pages.
>>
>> Any suggestions most welcome!
>>
>> Iain
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/e78f6620-4019-4e36-95cf-0aad5194313dn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x6_gs_RYvHR83BbZoO2tKvDW_V-hyF1NC2osZ1y2LmxA%40mail.gmail.com.

Re: [tesseract-ocr] A few characters being misrecognized

2024-07-26 Thread Zdenko Podobny

tesseract img1.png - --psm 6  -l fra
Juccsus

tesseract img2.png - --psm 6  -l fra
Bladë

Zdenko


pi 19. 7. 2024 o 5:12 Péter Györök  napísal(a):

> I'm using this command:
> tesseract file.png - --psm 6 -l script/Latin
>
> img1.png returns "JUCcCcsus" instead of "Juccsus".
> img2.png returns "Bladé" instead of "Bladë".
>
> Any suggestions on how to fix these?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/334c1f47-c957-431e-a5da-d9de11fd4531n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w0ujPvLjaSoMMXSejmMqYzyRWHOTDLJE6e_ts8Cfu2ww%40mail.gmail.com.

Re: [tesseract-ocr] Tessarct won't recognise single characters

2024-07-15 Thread Zdenko Podobny

Code that was posted here is not dangerous. Just a python coder would  make
it the right way.


Zdenko


po 15. 7. 2024 o 16:09 Mona Dastar  napísal(a):

> Hi everyone
> Regarding what Zdenko said, after the first section of module 3 I stopped
> because I had questions and I couldn’t understand the code, I have trouble
> with the last module what do you think?
> Since that I didn’t study and I am getting farther and further away.
> I appreciate your tips.
>
>
> On Mon, 15 Jul 2024 at 10:03 Zdenko Podobny  wrote:
>
>> My remark is about code quality. Code quality is relevant. Or indication
>> that somebody is doing copy&paste without understanding code - that is
>> dangerous.
>>
>>
>> Zdenko
>>
>>
>> po 15. 7. 2024 o 12:30 René JM Clais  napísal(a):
>>
>>> My code is working well and your remarks are out of the context.
>>>
>>> Le dim. 14 juil. 2024 à 19:47, Zdenko Podobny  a
>>> écrit :
>>>
>>>> So you do not understand the code you posted?
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> ne 14. 7. 2024 o 19:44 René JM Clais 
>>>> napísal(a):
>>>>
>>>>> I don't understand what do you mean ?
>>>>>
>>>>> Le dim. 14 juil. 2024 à 16:13, Zdenko Podobny  a
>>>>> écrit :
>>>>>
>>>>>> custom_config = r' -l  ' + 'eng' + '--psm 6
>>>>>>
>>>>>>
>>>>>> What is the point of this? To slow down the script?
>>>>>>
>>>>>> Zdenko
>>>>>>
>>>>>>
>>>>>> ne 14. 7. 2024 o 15:56 René JM Clais 
>>>>>> napísal(a):
>>>>>>
>>>>>>> import cv2
>>>>>>> import pytesseract as tesser
>>>>>>>
>>>>>>>
>>>>>>> originalImage = cv2.imread("myfile.jpg") #myfile.jpg   ===> original
>>>>>>> image
>>>>>>>
>>>>>>> (thresh, imgbw) = cv2.threshold(originalImage,180,255,
>>>>>>> cv2.THRESH_BINARY)   # black and white
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> cv2.imshow('Black white image', imgbw)
>>>>>>> cv2.waitKey(0)  #make enter
>>>>>>> cv2.destroyAllWindows()
>>>>>>>
>>>>>>> #tesseract transformation
>>>>>>> #
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> custom_config = r' -l  ' + 'eng' + '--psm 6  '
>>>>>>>
>>>>>>>
>>>>>>> text= tesser.image_to_string(imgbw,config=custom_config )
>>>>>>> print(text)  #the text
>>>>>>>
>>>>>>>
>>>>>>> Le sam. 13 juil. 2024 à 18:26, Iain Downs  a
>>>>>>> écrit :
>>>>>>>
>>>>>>>> Can you give me some example code?  I'm currently trying to get
>>>>>>>> tesseract working for C++ in Visual Studio and it's a bit of a 
>>>>>>>> nightmare.
>>>>>>>> python seems easier though it's not one of my main languages - I can 
>>>>>>>> try it
>>>>>>>> out though!
>>>>>>>>
>>>>>>>> Iain
>>>>>>>>
>>>>>>>> On Saturday, July 13, 2024 at 11:20:54 AM UTC+1 renec...@gmail.com
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> I try your example with tesseract for python - it works well
>>>>>>>>>
>>>>>>>>> Le jeu. 11 juil. 2024 à 20:35, Iain Downs  a
>>>>>>>>> écrit :
>>>>>>>>>
>>>>>>>>>> I'm trying to extract page numbers from scanned pages of text.
>>>>>>>>>> Page Numbers are either at the top or at the bottom - sometimes with 
>>>>>>>>>> titles
>>>>>>>>>> / authors / chapters.  Occasionally elsewhere, but I don't care 
>>>>>>>>>> about the
>>>>>>>>>> exceptions.
>>>>>>>>>>
&g

Re: [tesseract-ocr] Tessarct won't recognise single characters

2024-07-15 Thread Zdenko Podobny

My remark is about code quality. Code quality is relevant. Or indication
that somebody is doing copy&paste without understanding code - that is
dangerous.


Zdenko


po 15. 7. 2024 o 12:30 René JM Clais  napísal(a):

> My code is working well and your remarks are out of the context.
>
> Le dim. 14 juil. 2024 à 19:47, Zdenko Podobny  a écrit :
>
>> So you do not understand the code you posted?
>>
>> Zdenko
>>
>>
>> ne 14. 7. 2024 o 19:44 René JM Clais  napísal(a):
>>
>>> I don't understand what do you mean ?
>>>
>>> Le dim. 14 juil. 2024 à 16:13, Zdenko Podobny  a
>>> écrit :
>>>
>>>> custom_config = r' -l  ' + 'eng' + '--psm 6
>>>>
>>>>
>>>> What is the point of this? To slow down the script?
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> ne 14. 7. 2024 o 15:56 René JM Clais 
>>>> napísal(a):
>>>>
>>>>> import cv2
>>>>> import pytesseract as tesser
>>>>>
>>>>>
>>>>> originalImage = cv2.imread("myfile.jpg") #myfile.jpg   ===> original
>>>>> image
>>>>>
>>>>> (thresh, imgbw) = cv2.threshold(originalImage,180,255,
>>>>> cv2.THRESH_BINARY)   # black and white
>>>>>
>>>>>
>>>>>
>>>>> cv2.imshow('Black white image', imgbw)
>>>>> cv2.waitKey(0)  #make enter
>>>>> cv2.destroyAllWindows()
>>>>>
>>>>> #tesseract transformation
>>>>> #
>>>>>
>>>>>
>>>>>
>>>>> custom_config = r' -l  ' + 'eng' + '--psm 6  '
>>>>>
>>>>>
>>>>> text= tesser.image_to_string(imgbw,config=custom_config )
>>>>> print(text)  #the text
>>>>>
>>>>>
>>>>> Le sam. 13 juil. 2024 à 18:26, Iain Downs  a écrit :
>>>>>
>>>>>> Can you give me some example code?  I'm currently trying to get
>>>>>> tesseract working for C++ in Visual Studio and it's a bit of a nightmare.
>>>>>> python seems easier though it's not one of my main languages - I can try 
>>>>>> it
>>>>>> out though!
>>>>>>
>>>>>> Iain
>>>>>>
>>>>>> On Saturday, July 13, 2024 at 11:20:54 AM UTC+1 renec...@gmail.com
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> I try your example with tesseract for python - it works well
>>>>>>>
>>>>>>> Le jeu. 11 juil. 2024 à 20:35, Iain Downs  a
>>>>>>> écrit :
>>>>>>>
>>>>>>>> I'm trying to extract page numbers from scanned pages of text.
>>>>>>>> Page Numbers are either at the top or at the bottom - sometimes with 
>>>>>>>> titles
>>>>>>>> / authors / chapters.  Occasionally elsewhere, but I don't care about 
>>>>>>>> the
>>>>>>>> exceptions.
>>>>>>>>
>>>>>>>> I've loaded tesseract 5.4 (windows) and run some tests using the
>>>>>>>> executable.  I'm finding that if the page number is a single digit on 
>>>>>>>> the
>>>>>>>> line, tesseract ignores it (but otherwise does a fantastic job of OCR 
>>>>>>>> even
>>>>>>>> with skewed and noisy images).
>>>>>>>>
>>>>>>>> I've isolated the single line used that as input and tesseract
>>>>>>>> tells me 'the page is empty'.
>>>>>>>>
>>>>>>>> Here is a sample of a single line with a '1' in it resolution is
>>>>>>>> 300dpi.
>>>>>>>> [image: 101_bottom.jpg]
>>>>>>>>
>>>>>>>> Ultimately I would be writing a program using tesseract, but in the
>>>>>>>> first instance I'd like to see it work with the exe.
>>>>>>>>
>>>>>>>> So, can I tell tesseract to be less fussy with individual
>>>>>>>> characters and if not how would I do so programatically - if possible?
>>>>>>&g

Re: [tesseract-ocr] Tessarct won't recognise single characters

2024-07-14 Thread Zdenko Podobny

So you do not understand the code you posted?

Zdenko


ne 14. 7. 2024 o 19:44 René JM Clais  napísal(a):

> I don't understand what do you mean ?
>
> Le dim. 14 juil. 2024 à 16:13, Zdenko Podobny  a écrit :
>
>> custom_config = r' -l  ' + 'eng' + '--psm 6
>>
>>
>> What is the point of this? To slow down the script?
>>
>> Zdenko
>>
>>
>> ne 14. 7. 2024 o 15:56 René JM Clais  napísal(a):
>>
>>> import cv2
>>> import pytesseract as tesser
>>>
>>>
>>> originalImage = cv2.imread("myfile.jpg") #myfile.jpg   ===> original
>>> image
>>>
>>> (thresh, imgbw) = cv2.threshold(originalImage,180,255,
>>> cv2.THRESH_BINARY)   # black and white
>>>
>>>
>>>
>>> cv2.imshow('Black white image', imgbw)
>>> cv2.waitKey(0)  #make enter
>>> cv2.destroyAllWindows()
>>>
>>> #tesseract transformation
>>> #
>>>
>>>
>>>
>>> custom_config = r' -l  ' + 'eng' + '--psm 6  '
>>>
>>>
>>> text= tesser.image_to_string(imgbw,config=custom_config )
>>> print(text)  #the text
>>>
>>>
>>> Le sam. 13 juil. 2024 à 18:26, Iain Downs  a écrit :
>>>
>>>> Can you give me some example code?  I'm currently trying to get
>>>> tesseract working for C++ in Visual Studio and it's a bit of a nightmare.
>>>> python seems easier though it's not one of my main languages - I can try it
>>>> out though!
>>>>
>>>> Iain
>>>>
>>>> On Saturday, July 13, 2024 at 11:20:54 AM UTC+1 renec...@gmail.com
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> I try your example with tesseract for python - it works well
>>>>>
>>>>> Le jeu. 11 juil. 2024 à 20:35, Iain Downs  a écrit :
>>>>>
>>>>>> I'm trying to extract page numbers from scanned pages of text.  Page
>>>>>> Numbers are either at the top or at the bottom - sometimes with titles /
>>>>>> authors / chapters.  Occasionally elsewhere, but I don't care about the
>>>>>> exceptions.
>>>>>>
>>>>>> I've loaded tesseract 5.4 (windows) and run some tests using the
>>>>>> executable.  I'm finding that if the page number is a single digit on the
>>>>>> line, tesseract ignores it (but otherwise does a fantastic job of OCR 
>>>>>> even
>>>>>> with skewed and noisy images).
>>>>>>
>>>>>> I've isolated the single line used that as input and tesseract tells
>>>>>> me 'the page is empty'.
>>>>>>
>>>>>> Here is a sample of a single line with a '1' in it resolution is
>>>>>> 300dpi.
>>>>>> [image: 101_bottom.jpg]
>>>>>>
>>>>>> Ultimately I would be writing a program using tesseract, but in the
>>>>>> first instance I'd like to see it work with the exe.
>>>>>>
>>>>>> So, can I tell tesseract to be less fussy with individual characters
>>>>>> and if not how would I do so programatically - if possible?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Iain
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/c42d435c-4db5-48b5-94d3-5b761d340731n%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/c42d435c-4db5-48b5-94d3-5b761d340731n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>>> To view this discussion on the web v

Re: [tesseract-ocr] Tessarct won't recognise single characters

2024-07-14 Thread Zdenko Podobny

>
> custom_config = r' -l  ' + 'eng' + '--psm 6


What is the point of this? To slow down the script?

Zdenko


ne 14. 7. 2024 o 15:56 René JM Clais  napísal(a):

> import cv2
> import pytesseract as tesser
>
>
> originalImage = cv2.imread("myfile.jpg") #myfile.jpg   ===> original image
>
> (thresh, imgbw) = cv2.threshold(originalImage,180,255, cv2.THRESH_BINARY)
>  # black and white
>
>
>
> cv2.imshow('Black white image', imgbw)
> cv2.waitKey(0)  #make enter
> cv2.destroyAllWindows()
>
> #tesseract transformation
> #
>
>
>
> custom_config = r' -l  ' + 'eng' + '--psm 6  '
>
>
> text= tesser.image_to_string(imgbw,config=custom_config )
> print(text)  #the text
>
>
> Le sam. 13 juil. 2024 à 18:26, Iain Downs  a écrit :
>
>> Can you give me some example code?  I'm currently trying to get tesseract
>> working for C++ in Visual Studio and it's a bit of a nightmare.  python
>> seems easier though it's not one of my main languages - I can try it out
>> though!
>>
>> Iain
>>
>> On Saturday, July 13, 2024 at 11:20:54 AM UTC+1 renec...@gmail.com wrote:
>>
>>> Hi,
>>> I try your example with tesseract for python - it works well
>>>
>>> Le jeu. 11 juil. 2024 à 20:35, Iain Downs  a écrit :
>>>
 I'm trying to extract page numbers from scanned pages of text.  Page
 Numbers are either at the top or at the bottom - sometimes with titles /
 authors / chapters.  Occasionally elsewhere, but I don't care about the
 exceptions.

 I've loaded tesseract 5.4 (windows) and run some tests using the
 executable.  I'm finding that if the page number is a single digit on the
 line, tesseract ignores it (but otherwise does a fantastic job of OCR even
 with skewed and noisy images).

 I've isolated the single line used that as input and tesseract tells me
 'the page is empty'.

 Here is a sample of a single line with a '1' in it resolution is 300dpi.
 [image: 101_bottom.jpg]

 Ultimately I would be writing a program using tesseract, but in the
 first instance I'd like to see it work with the exe.

 So, can I tell tesseract to be less fussy with individual characters
 and if not how would I do so programatically - if possible?

 Thanks

 Iain

 --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesseract-oc...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/tesseract-ocr/c42d435c-4db5-48b5-94d3-5b761d340731n%40googlegroups.com
 
 .

>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/2e56b599-4dcf-4b93-8e1b-40a57b36d3e9n%40googlegroups.com
>> 
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_qXNcQyuQYBTVdkx1kLYnVpLJJQ-1a%3DwM7SBCcJsmANvw%40mail.gmail.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wgH%2Bt4ZCs_nE0zoPtmwT6gzmRcF5YDZsJrZBAoghSdmA%40mail.gmail.com.

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-07-14 Thread Zdenko Podobny

Ehm:

   1. Tesseract v3 (legacy) engine training is based on characters.
   2. Tesseract LSTM engine (tesseract >=v4) training script is based on
   lines (group of words)

Box files reflect that. And yes - box files are important.


Zdenko


pi 12. 7. 2024 o 14:14 Mateusz Matela  napísal(a):

> As an experiment, I run the training on a small sample produced with
> text2image. Then I converted the .box files so that each character is
> assigned common bounding rectangle from all the characters and run the
> training again. The outputs were identical in both cases. Then I removed
> the box file and let the training script autogenerate them. In that case
> the reported error rates were crazy, like 99% instead of 0.5%.
> This suggests that conclusion 3 is correct.
>
> środa, 10 lipca 2024 o 15:17:07 UTC+2 Mateusz Matela napisał(a):
>
>> Hi all,
>>
>> Sorry if double posting, my previous message didn't appear and I don't
>> see any info about waiting for acceptance or something.
>> I was searching for this topic in this forum and it was mentioned a few
>> times, but I couldn't find a clear and definitive explanation.
>>
>> How does the information put in the .box files affect the training
>> process? The file contains coordinates for each character in the txt file,
>> but the documentation says that since Tesseract 4.0 the model operates on
>> the level of whole lines. Some tools like text2image generate the .box
>> files with accurate coordinates for each character. When the .box files are
>> missing the tesstrain Makefile generates them using generate_line_box.py,
>> which assigns the same full image area to each character.
>>
>> I see 3 possible conclusions, which one is closest to the truth?
>>
>> 1. The .box files do not affect the LSTM training at all and are just a
>> leftover from the times of Tesseract 3. In that case, ideally in the future
>> they could be completely dropped or only required/generated when
>> specifically working with the legacy engine.
>>
>> 2. There is still a chance that training will work better with exact
>> coordinates and the generate_line_box.py is just a cheap workaround that
>> could be improved on in the future.
>>
>> 3. The .box file is still important in case you prefer to define the
>> coordinates for the text in the image instead of cropping the image. The
>> granularity of the coordinates is not imporant as Tesseract will just work
>> on a box that encapsulates all of the character boxes. Even if confusing,
>> this approach is still better than having a different .box file formats for
>> LSTM and the legacy engine.
>>
>> I'll be grateful for any wisdom on this.
>>
>> Thanks
>> Mateusz
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/b17225d5-2b78-41bd-994f-05305b9a443dn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wrnDN%3D%3Dws6U3nv%2B9ef%3D64rvpGPa2Pf-838dmHH8fM97A%40mail.gmail.com.

Re: [tesseract-ocr] Re: Text extraction failure after preprocessing.

2024-06-28 Thread Zdenko Podobny

As far as I remember, the  traineddata  are from
https://github.com/arturaugusto/display_ocr/blob/master/letsgodigital/letsgodigital.traineddata
Also, check https://github.com/Shreeshrii/tessdata_ssd for Seven Segment
Display recognition.

Zdenko


pi 28. 6. 2024 o 17:07 'uday kaipa' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> Hi Zdenko,
>
>
> Thanks for your recommendation about image format and letsgodigital
> trainidata. Yes, you are right. I got the digits from a segment display.
> I would try the training process before that i wanted to try other options.
>
> I suppose you have used the lets.traindata
> 
>  after
> renaming, when i tried the same command with same psm, on the PNG image, I
> got  .4 instead.
> By the way, Did you apply any processing on the  image?,  the edges look
> slightly different.
>
> tesseract 14.png out -l lets --oem 0 --psm 7
> .4
>
> Thanks for your time.
>
> On Friday, June 28, 2024 at 3:31:15 PM UTC+2 zdenop wrote:
>
>> First of all, using jpg as a format for image processing and OCR is not
>> very smart.
>>
>> Next: it does not seem like a very standard font... maybe you will need
>> to train tesseract for it.
>> For me, it looks like a heavy preprocessed 7-segment font... so I tried
>> this:
>>
>> tesseract 14.png - --psm 7 --oem 0 -l letsgodigital
>> 14
>>
>> Zdenko
>>
>>
>> pi 28. 6. 2024 o 14:09 'uday kaipa' via tesseract-ocr <
>> tesser...@googlegroups.com> napísal(a):
>>
>>> I have resized the image so that text height would be around 30pxs and i
>>> have tried with 10px boarder as recommended in some threads here.
>>> I converted image to binary, and tried all PSM modes.
>>> I am not sure why it is not OCR'ed properly.
>>>
>>> Any help is appreciated. :)
>>>
>>>
>>>
>>>
>>>
>>> On Thursday, June 27, 2024 at 6:24:36 PM UTC+2 uday kaipa wrote:
>>>
 Hi,

 I have an image having number 96 in it.(that might contains a number
 between 0 and 100.) PFA.
 I have used tesseract PSM from 6 to 13 and image size and font and
 everything looks good to me. Text is recognized as 36.
 When i try to adjust padding or other pre-processing, it would work for
 this image and some images are recognized incorrectly.

 Can anyone recommend any other pre-processing that might improve the
 recognition.

 *t**esseract --oem 1 --psm 7 -c tessedit_char_whitelist=0123456789.:
 C:/Users/xxx/Desktop/test_folder/IMG_2303_2cfac/subboxes/Image_BHU32_1_PREPROCESSED_27-06-2024_17h39m53s.JPG
 new hocr*


 *Many thanks in advance.*


 *Regards*
 *Uday*


 --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/d59827e4-6973-45af-92c0-e2aebbd7f2e7n%40googlegroups.com
>>> 
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/09a5c5e1-2cc7-49c2-9833-e2dc5c770203n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x0HijsPs5pjBzkibB4euDV5viQNjxhj%3DORXuK8G0MaXg%40mail.gmail.com.

Re: [tesseract-ocr] Re: Text extraction failure after preprocessing.

2024-06-28 Thread Zdenko Podobny

First of all, using jpg as a format for image processing and OCR is not
very smart.

Next: it does not seem like a very standard font... maybe you will need
to train tesseract for it.
For me, it looks like a heavy preprocessed 7-segment font... so I tried
this:

tesseract 14.png - --psm 7 --oem 0 -l letsgodigital
14

Zdenko


pi 28. 6. 2024 o 14:09 'uday kaipa' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> I have resized the image so that text height would be around 30pxs and i
> have tried with 10px boarder as recommended in some threads here.
> I converted image to binary, and tried all PSM modes.
> I am not sure why it is not OCR'ed properly.
>
> Any help is appreciated. :)
>
>
>
>
>
> On Thursday, June 27, 2024 at 6:24:36 PM UTC+2 uday kaipa wrote:
>
>> Hi,
>>
>> I have an image having number 96 in it.(that might contains a number
>> between 0 and 100.) PFA.
>> I have used tesseract PSM from 6 to 13 and image size and font and
>> everything looks good to me. Text is recognized as 36.
>> When i try to adjust padding or other pre-processing, it would work for
>> this image and some images are recognized incorrectly.
>>
>> Can anyone recommend any other pre-processing that might improve the
>> recognition.
>>
>> *t**esseract --oem 1 --psm 7 -c tessedit_char_whitelist=0123456789.:
>> C:/Users/xxx/Desktop/test_folder/IMG_2303_2cfac/subboxes/Image_BHU32_1_PREPROCESSED_27-06-2024_17h39m53s.JPG
>> new hocr*
>>
>>
>> *Many thanks in advance.*
>>
>>
>> *Regards*
>> *Uday*
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d59827e4-6973-45af-92c0-e2aebbd7f2e7n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wLFxgMXh-QV1nZoz_ba%3DJKtNvgF0r%2BgHBwbj__LVSWMw%40mail.gmail.com.

Re: [tesseract-ocr] Error when trying to build Tesseract DLL from Scratch on Arch Linux via Cmake

2024-06-21 Thread Zdenko Podobny

Cross compiling is tricky you need to know what are you doing and how to
solve problems.
Better solution is to use https://github.com/UB-Mannheim/tesseract/wiki
AFAIK `cmake ..` will configure package for current system (e.g. does not
cross compile)

Zdenko


št 20. 6. 2024 o 22:32 Danny  napísal(a):

> Hey,
> sorry for any confusion this post may have cause.
> Yes, I'm trying to cross-compile Tesseract on Arch Linux for Windows.
> I've installed leptonica via the Arch package manager.
> Leptonica Version 1.84.1 is installed on my system.
> I've also installed webp via the package manager (libwebp on arch)
> I've done the following steps to run into this problem:
> 1. clone the tesseract repository via git clone
> 2. create a build folder inside the tesseract folder and navigate into it
> (mkdir build -> cd build)
> 3. run cmake ..
>
> This is the full output of the cmake ..  on my system:
>
> cmake ..
> -- Configuring tesseract version 5.4.1...
> -- IPO / LTO supported
> -- CMAKE_SYSTEM_PROCESSOR=
> -- Found leptonica version: 1.84.1
> CMake Error at
> /home/tesseract/build/CMakeFiles/CMakeScratch/TryCompile-RwJF5R/cmTC_5cec5Targets.cmake:21
> (set_target_properties):
>   The link interface of target "leptonica" contains:
>
> WebP::webp
>
>   but the target was not found.  Possible reasons include:
>
> * There is a typo in the target name.
> * A find_package call is missing for an IMPORTED target.
> * An ALIAS target is missing.
>
> Call Stack (most recent call first):
>
> /home/tesseract/build/CMakeFiles/CMakeScratch/TryCompile-RwJF5R/CMakeLists.txt:16
> (include)
>
>
> CMake Error at cmake/CheckFunctions.cmake:34 (try_run):
>   Failed to generate test project build system.
> Call Stack (most recent call first):
>   CMakeLists.txt:409 (check_leptonica_tiff_support)
>
> I hope these information help.
> On Thursday, June 20, 2024 at 7:47:43 PM UTC+2 zdenop wrote:
>
>> I am lost in your post...
>> e.g. DLL is recreated on Windows not on Linux. Are you trying to
>> cross-compile Tesseract?
>> How did you install Leptonica? Which version?
>> What does it mean that "webp is installed on my system" - only runtime?
>>
>> Can you please provide each step and its output so we can replicate the
>> problem?
>>
>> Zdenko
>>
>>
>> št 20. 6. 2024 o 17:47 Danny  napísal(a):
>>
>>> Hello,
>>> I'm currently trying to build a tesseract DLL from Scratch on Arch Linux
>>> with cmake.
>>> I've created a Build Folder inside the tesseract folder and executed
>>> cmake ..
>>> But when I execute cmake .. I get the following error:
>>> CMake Error at
>>> /home/tesseract/build/CMakeFiles/CMakeScratch/TryCompile-LokfWE/cmTC_0b6e5Targets.cmake:21
>>> (set_target_properties):
>>>   The link interface of target "leptonica" contains:
>>>
>>> WebP::webp
>>>
>>>   but the target was not found.  Possible reasons include:
>>>
>>> * There is a typo in the target name.
>>> * A find_package call is missing for an IMPORTED target.
>>> * An ALIAS target is missing.
>>>
>>> Call Stack (most recent call first):
>>>
>>> /home/tesseract/build/CMakeFiles/CMakeScratch/TryCompile-LokfWE/CMakeLists.txt:16
>>> (include)
>>>
>>>
>>> CMake Error at cmake/CheckFunctions.cmake:34 (try_run):
>>>   Failed to generate test project build system.
>>> Call Stack (most recent call first):
>>>   CMakeLists.txt:409 (check_leptonica_tiff_support)
>>>
>>> I have to add that webp is installed on my system.
>>> Any idea as to why this is happening?
>>> Any help would be much appreciated.
>>>
>>> Kind regards.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/2c66def9-43f6-485d-89cd-b2bb99d66009n%40googlegroups.com
>>> 
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/75129f09-4252-4c20-ab95-3453809c9803n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wHc89%3DDV%3D%3D3RNWm%2BLktmk9sm_2Z4A0Qv3

Re: [tesseract-ocr] Error when trying to build Tesseract DLL from Scratch on Arch Linux via Cmake

2024-06-20 Thread Zdenko Podobny

I am lost in your post...
e.g. DLL is recreated on Windows not on Linux. Are you trying to
cross-compile Tesseract?
How did you install Leptonica? Which version?
What does it mean that "webp is installed on my system" - only runtime?

Can you please provide each step and its output so we can replicate the
problem?

Zdenko


št 20. 6. 2024 o 17:47 Danny  napísal(a):

> Hello,
> I'm currently trying to build a tesseract DLL from Scratch on Arch Linux
> with cmake.
> I've created a Build Folder inside the tesseract folder and executed cmake
> ..
> But when I execute cmake .. I get the following error:
> CMake Error at
> /home/tesseract/build/CMakeFiles/CMakeScratch/TryCompile-LokfWE/cmTC_0b6e5Targets.cmake:21
> (set_target_properties):
>   The link interface of target "leptonica" contains:
>
> WebP::webp
>
>   but the target was not found.  Possible reasons include:
>
> * There is a typo in the target name.
> * A find_package call is missing for an IMPORTED target.
> * An ALIAS target is missing.
>
> Call Stack (most recent call first):
>
> /home/tesseract/build/CMakeFiles/CMakeScratch/TryCompile-LokfWE/CMakeLists.txt:16
> (include)
>
>
> CMake Error at cmake/CheckFunctions.cmake:34 (try_run):
>   Failed to generate test project build system.
> Call Stack (most recent call first):
>   CMakeLists.txt:409 (check_leptonica_tiff_support)
>
> I have to add that webp is installed on my system.
> Any idea as to why this is happening?
> Any help would be much appreciated.
>
> Kind regards.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/2c66def9-43f6-485d-89cd-b2bb99d66009n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8ymFFntuwWfOPQZ%2BF%2BUxkS749zBYY49GmXUn9gELFgcqg%40mail.gmail.com.

Re: [tesseract-ocr] Problem using "--oem 0" in Tesseract 5.4.0

2024-06-07 Thread Zdenko Podobny

Please show minimal respect and first google for a solution.


Zdenko


pi 7. 6. 2024 o 18:23 Fred Andrews  napísal(a):

> I captured a screenshot of a VirtualBox guest boot crash and Tesseract
> didn't seem to do very well OCRing that text, so I wanted to try the older
> engine, which the help says should be possible by using "--oem 0".
> However, this doesn't work:
>
> D:\temp\virtualbox-project>"c:\Program Files\Tesseract-OCR\tesseract.exe"
> vb-crash.png output --oem 0
> Error: Tesseract (legacy) engine requested, but components are not present
> in c:\Program Files\Tesseract-OCR/tessdata/eng.traineddata!!
> Failed loading language 'eng'
> Tesseract couldn't load any languages!
> Could not initialize tesseract.
>
> But, I installed Tesseract 5.4.0 using the prebuilt binary:
>
> https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-5.4.0.20240606.exe
> and so that file IS present at the location claimed:
>
> c:\Program Files\Tesseract-OCR\tessdata>dir
>  Volume in drive C is DESKOS
>  Volume Serial Number is EA89-635E
>
>  Directory of c:\Program Files\Tesseract-OCR\tessdata
>
> 06/07/2024  10:59 AM  .
> 06/07/2024  10:59 AM  ..
> 06/07/2024  04:50 AM  configs
> 06/06/2024  09:18 AM 4,113,088 eng.traineddata
> 01/16/2019  03:53 PM33 eng.user-patterns
> 01/16/2019  03:53 PM27 eng.user-words
> 06/06/2024  09:19 AM   128,076 jaxb-api-2.3.1.jar
> 06/06/2024  09:18 AM10,562,727 osd.traineddata
> 06/06/2024  09:41 AM   572 pdf.ttf
> 06/06/2024  09:19 AM   125,187 piccolo2d-core-3.0.1.jar
> 06/06/2024  09:19 AM   149,558 piccolo2d-extras-3.0.1.jar
> 06/07/2024  04:50 AM  script
> 06/06/2024  09:19 AM26,376 ScrollView.jar
> 06/07/2024  04:50 AM  tessconfigs
>9 File(s) 15,105,644 bytes
>5 Dir(s)  1,600,415,711,232 bytes free
>
> So it looks like either paths aren't being handled properly on Windows
> (note the use of forward slashes in the output), or somehow the old engine
> expects a different format than the eng.traineddata installed with 5.4.0
>
> Should I attempt to file an issue on the Mannheim Github site?
> https://github.com/UB-Mannheim/tesseract
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d7361b3a-a338-4a27-b1f3-0914160b0ff3n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zhC2ZZej_TD2HWQ1yECw2hJWaaTDHmkd4%2BEgTLpChVTg%40mail.gmail.com.

Re: [tesseract-ocr] Error when running "make training" command

2024-05-29 Thread Zdenko Podobny

So:

   1. If you have a problem - use example data (ocrd-testset.zip) or
   provide your data set for reproducing the problem
   2. make sure you use the latest version of tesstrain
   3. ' *make training' *does not produce the output you presented. Provide
   real steps for reproducing the problem, if you are interested in help.


Zdenko


st 29. 5. 2024 o 15:45 Duy Hoàng  napísal(a):

> I'm creating a training file on windows based on the instructions here:
> https://github.com/tesseract-ocr/tesstrain/
>
> I'am using tesseract ocr version 5.3.4
> Can someone help me with this case
>
> $ *make training*
> You are using make version: 4.4.1
> unicharset_extractor --output_unicharset "data/korletter/unicharset"
> --norm_mode 2 "data/korletter/all-gt"
> Extracting unicharset from plain text file data/korletter/all-gt
> Wrote unicharset file data/korletter/unicharset
> python shuffle.py 0 "data/korletter/all-lstmf"
> python generate_eval_train.py data/korletter/all-lstmf 0.90
> dos2unix "data/korletter/korletter.numbers"
> dos2unix: data/korletter/korletter.numbers: No such file or directory
> dos2unix: Skipping data/korletter/korletter.numbers, not a regular file.
> make: [Makefile:290: data/korletter/korletter.traineddata] Error 2
> (ignored)
> dos2unix "data/korletter/korletter.punc"
> dos2unix: data/korletter/korletter.punc: No such file or directory
> dos2unix: Skipping data/korletter/korletter.punc, not a regular file.
> make: [Makefile:291: data/korletter/korletter.traineddata] Error 2
> (ignored)
> dos2unix "data/korletter/korletter.wordlist"
> dos2unix: data/korletter/korletter.wordlist: No such file or directory
> dos2unix: Skipping data/korletter/korletter.wordlist, not a regular file.
> make: [Makefile:292: data/korletter/korletter.traineddata] Error 2
> (ignored)
> dos2unix "data/langdata/korletter/korletter.config"
> dos2unix: data/langdata/korletter/korletter.config: No such file or
> directory
> dos2unix: Skipping data/langdata/korletter/korletter.config, not a regular
> file.
> make: [Makefile:293: data/korletter/korletter.traineddata] Error 2
> (ignored)
> combine_lang_model \
>   --input_unicharset data/korletter/unicharset \
>   --script_dir data/langdata \
>   --numbers data/korletter/korletter.numbers \
>   --puncs data/korletter/korletter.punc \
>   --words data/korletter/korletter.wordlist \
>   --output_dir data \
>\
>   --lang korletter
> Failed to read data from: data/korletter/korletter.wordlist
> Failed to read data from: data/korletter/korletter.punc
> Failed to read data from: data/korletter/korletter.numbers
> Loaded unicharset of size 4 from file data/korletter/unicharset
> Setting unichar properties
> Setting script properties
> Config file is optional, continuing...
> Failed to read data from: data/langdata/korletter/korletter.config
> Null char=2
> Created data/korletter/korletter.traineddata
> lstmtraining \
>   --debug_interval 0 \
>   --traineddata data/korletter/korletter.traineddata \
>   --learning_rate 0.002 \
>   --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx192 O1c4]" \
>   --model_output data/korletter/checkpoints/korletter \
>   --train_listfile data/korletter/list.train \
>   --eval_listfile data/korletter/list.eval \
>   --max_iterations 1 \
>   --target_error_rate 0.01 \
> 2>&1 | tee -a data/korletter/training.log
> Failed to load list of training filenames from data/korletter/list.train
>
> lstmtraining \
> --stop_training \
> --continue_from data/korletter/checkpoints/korletter_checkpoint \
> --traineddata data/korletter/korletter.traineddata \
> --model_output data/korletter.traineddata
> Failed to read continue from:
> data/korletter/checkpoints/korletter_checkpoint
> make: *** [Makefile:347: data/korletter.traineddata] Error 1
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/397d129c-0e61-4003-9cb4-c6b7f8a615a8n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yeCwyuVbQ2wyx8FJz34c%3DNM%3Ds-OCK6v6udORs4K_N0zQ%40mail.gmail.com.

Re: [tesseract-ocr] Openmp cannot be disabled

2024-05-25 Thread Zdenko Podobny

Well, I would suggest making a replicable case that prove the problem if
you want the help.
Based on the description you provided nobody can help you (neither
charlessw)
The problem could be somewhere in your code, in C#, in the tesseract, or
even in your environment/OS...
You observed the problem => you need to narrow down where is the source of
the problem.


Zdenko


so 25. 5. 2024 o 12:28 Kassim Papa  napísal(a):

> I do not claim anything.
>
> Thank you for your proposition. We will test that and post on their github
> (charlessw, the guy who made the wrapper)
>
> Stephan weil closed the issue on the github of tesseract saying :
>
> "The Tesseract for Windows which is provided by UB Mannheim does not have
> this issue: it runs always single-threaded because it was built with OpenMP
> disabled. You did not say what Tesseract binary and which version you used."
>
> So I guess you must be right, this is where our effort should go.
>
> But I didn't even know that, I don't understand all those openmp changes
> that have been made. COuld you explain them to me? Since the issue is
> closed I cannot talk to stephan weil anymore.
>
>
> Le samedi 25 mai 2024 à 12:12:21 UTC+2, zdenop a écrit :
>
>> You need to replicate it with the tesseract executable if you want to
>> claim it is Tesseract problem
>>
>> Zdenko
>>
>>
>> so 25. 5. 2024 o 12:05 Kassim Papa  napísal(a):
>>
>>> We use a C# wrapper.
>>> This .net library found on nugget :
>>> https://github.com/charlesw/tesseract
>>>
>>>
>>>
>>> Le samedi 25 mai 2024 à 11:14:36 UTC+2, zdenop a écrit :
>>>
 How did you install tesseract?

 What is the output of `tesseract -v`?


 Zdenko


 so 25. 5. 2024 o 11:03 Kassim Papa  napísal(a):

> Current Behavior :
>
> Despite putting omp_thread_limit=1 tesseract still use all cores on my
> machine (i7-7th - windows 10).
>
> We used this (at the beginning of the code) :
> Environment.SetEnvironmentVariable("OMP_THREAD_LIMIT", "1");
>
> And this ( a batch) :
> @echo off
> set OMP_THREAD_LIMIT=1
> start "" "path_to_your_application.exe
>
> We have 1 big image Tesseract takes 7 second when we go over it at
> once.
>
> When we divide the image in 4 and run 4 instances of tesseract in
> parallel it take 7 second too : no changes at all.
>
> Expected Behavior :
>
> We should see in the task manager that tesseract only use 1 cores
>
> there should be a significant improvement when running 4 images in
> parallel. Multiple people had success with this method.
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/008e9795-877c-4638-af08-0dc7e3af00ecn%40googlegroups.com
> 
> .
>
 --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>>
>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/2ca978b6-0061-425a-aa2e-a7065fa5acc4n%40googlegroups.com
>>> 
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6479e4ae-c496-4061-971c-c51c15999325n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wzUkXGtvgsokp2C4K-Wp3hxz_ZpPq-qp1zKW7Pq%3DmoRA%40mail.gmail.com.

Re: [tesseract-ocr] Openmp cannot be disabled

2024-05-25 Thread Zdenko Podobny

You need to replicate it with the tesseract executable if you want to claim
it is Tesseract problem

Zdenko


so 25. 5. 2024 o 12:05 Kassim Papa  napísal(a):

> We use a C# wrapper.
> This .net library found on nugget : https://github.com/charlesw/tesseract
>
>
>
> Le samedi 25 mai 2024 à 11:14:36 UTC+2, zdenop a écrit :
>
>> How did you install tesseract?
>>
>> What is the output of `tesseract -v`?
>>
>>
>> Zdenko
>>
>>
>> so 25. 5. 2024 o 11:03 Kassim Papa  napísal(a):
>>
>>> Current Behavior :
>>>
>>> Despite putting omp_thread_limit=1 tesseract still use all cores on my
>>> machine (i7-7th - windows 10).
>>>
>>> We used this (at the beginning of the code) :
>>> Environment.SetEnvironmentVariable("OMP_THREAD_LIMIT", "1");
>>>
>>> And this ( a batch) :
>>> @echo off
>>> set OMP_THREAD_LIMIT=1
>>> start "" "path_to_your_application.exe
>>>
>>> We have 1 big image Tesseract takes 7 second when we go over it at once.
>>>
>>> When we divide the image in 4 and run 4 instances of tesseract in
>>> parallel it take 7 second too : no changes at all.
>>>
>>> Expected Behavior :
>>>
>>> We should see in the task manager that tesseract only use 1 cores
>>>
>>> there should be a significant improvement when running 4 images in
>>> parallel. Multiple people had success with this method.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/008e9795-877c-4638-af08-0dc7e3af00ecn%40googlegroups.com
>>> 
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/2ca978b6-0061-425a-aa2e-a7065fa5acc4n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wC1aBtRzhuy0q9pUVTGLfRhKyT0W4m4K%3D-SrjDeRK99w%40mail.gmail.com.

Re: [tesseract-ocr] Openmp cannot be disabled

2024-05-25 Thread Zdenko Podobny

How did you install tesseract?

What is the output of `tesseract -v`?


Zdenko


so 25. 5. 2024 o 11:03 Kassim Papa  napísal(a):

> Current Behavior :
>
> Despite putting omp_thread_limit=1 tesseract still use all cores on my
> machine (i7-7th - windows 10).
>
> We used this (at the beginning of the code) :
> Environment.SetEnvironmentVariable("OMP_THREAD_LIMIT", "1");
>
> And this ( a batch) :
> @echo off
> set OMP_THREAD_LIMIT=1
> start "" "path_to_your_application.exe
>
> We have 1 big image Tesseract takes 7 second when we go over it at once.
>
> When we divide the image in 4 and run 4 instances of tesseract in parallel
> it take 7 second too : no changes at all.
>
> Expected Behavior :
>
> We should see in the task manager that tesseract only use 1 cores
>
> there should be a significant improvement when running 4 images in
> parallel. Multiple people had success with this method.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/008e9795-877c-4638-af08-0dc7e3af00ecn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xMWPjDeq9Yf%3DL4%3DbtesvFD6_h5nV-VL2atWxCLfqi49Q%40mail.gmail.com.

Re: [tesseract-ocr] What is arabic language code ?

2024-05-16 Thread Zdenko Podobny

What does not work?
What did you do?
How we can replicate it???

Zdenko


pi 17. 5. 2024 o 6:41 Zaid Vss  napísal(a):

> ar - ara both not working
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/df96bd96-13ad-49f1-a7e5-ba90a9c218dan%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8z%3D4Ou9RxmSppmbNEmeBF%3DsHjaCNGHurtPbNF4p_1CFLw%40mail.gmail.com.

Re: [tesseract-ocr] How do you set parameter in tesseract 5? bug

2024-05-08 Thread Zdenko Podobny

>
> It doesn't seem to work on our end.


What does it mean? How we can replicate it?

We're using the C#version and want to set parameter with a confing file but
> that doesn't do anything.


Have you tried cli to avoid possible issues with tesseract C# wrapper?

Zdenko


st 8. 5. 2024 o 7:26 Kassim Papa  napísal(a):

> It doesn't seem to work on our end.
>
> We're using the C#version and want to set parameter with a confing file
> but that doesn't do anything.
>
> We went on this page for the parameter :
> https://muthu.co/all-tesseract-ocr-options/
>
> We wanted to "play" with them to see what make our usecase work better.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/b9bb1761-995f-45c8-a6ba-5c9de4af437an%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wEar557yqdW550LB2SgcWdF14TNXf4XJncnK4b4KKphg%40mail.gmail.com.

Re: [tesseract-ocr] Cannt extract number from image screenshot

2024-04-29 Thread Zdenko Podobny

First, show us that you tried everything from the documentation.

Zdenko


po 29. 4. 2024 o 21:16 Master - Event  napísal(a):

> i try  extract number from screenshot but i cannt. someone help me why.
> my code ```
> img_data = base64.b64decode(screenshot)
> image_pil = Image.open(BytesIO(img_data))
> box = (0, 222, 80, 240)
> cropped_img = image_pil.crop(box)
> image_cv2 = np.array(cropped_img)
> image_cv2 = cv2.cvtColor(image_cv2, cv2.COLOR_RGB2BGR)
> gray_image = cv2.cvtColor(image_cv2, cv2.COLOR_BGR2GRAY)
> resized_image = cv2.resize(gray_image, None, fx=3, fy=3, interpolation
> =cv2.INTER_CUBIC)
> cv2.imshow('__', resized_image)
> cv2.waitKey(0)
> pytesseract.pytesseract.tesseract_cmd = r'C:\Program
> Files\Tesseract-OCR\tesseract.exe'
> extracted_text = pytesseract.image_to_string(resized_image, config="-c
> tessedit_char_whitelist=0123456789,./")
> ```
> [image: Screenshot 2024-04-29 202702.png]
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/2fdb-4bd8-47de-b905-467a863f53adn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zeG%3D3D9MJOcF5tgctdE1vqejzkCja0PnVwjER0-e276g%40mail.gmail.com.

Re: [tesseract-ocr] Beginner question : could not initialize tesseract, missing eng.traineddata file in tessdata

2024-04-25 Thread Zdenko Podobny

If you used the tesstrain you trained the lstm engine. Why do you then ask
tesseract to use a legacy engine?
Do you understand what you are doing?

Zdenko


št 25. 4. 2024 o 11:35 Surya VaraPrasad Alla 
napísal(a):

> eng_pcb.traineddata is a traineddata starting with eng.traineddata
>
> i did lstm training to improve the detection of ocr rather than the
> recognition. i used tesstrain git repo.
>
> final error: couldn't find the legacy components in eng_pcb.traineddata
>
> On Monday, April 22, 2024 at 6:43:54 PM UTC+2 zdenop wrote:
>
>> No, you are not using best float tessdata files from:
>> https://github.com/tesseract-ocr/tessdata_best/blob/main/eng.traineddata
>> There is nothing like eng_pcb.traineddata. (read your error message)
>>
>>
>> Zdenko
>>
>>
>> po 22. 4. 2024 o 17:40 Surya VaraPrasad Alla 
>> napísal(a):
>>
>>> Hello,
>>>
>>> I have the similar response
>>>
>>> pytesseract.pytesseract.TesseractError: (1, "read_params_file: Can't
>>> open tessedit_char_blacklist=,;: Error: Tesseract (legacy) engine
>>> requested, but components are not present in
>>> external/tesstrain/data/eng_pcb/eng_pcb.traineddata!! Failed loading
>>> language 'eng_pcb' Tesseract couldn't load any languages! Could not
>>> initialize tesseract.")
>>>
>>> tesseract --version:
>>> tesseract -v
>>> tesseract 4.1.1
>>>  leptonica-1.82.0
>>>   libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 :
>>> libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
>>>  Found AVX512BW
>>>  Found AVX512F
>>>  Found AVX2
>>>  Found AVX
>>>  Found FMA
>>>  Found SSE
>>>  Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8
>>> liblz4/1.9.3 libzstd/1.4.8
>>>
>>> I am using best float tessdata files from:
>>> https://github.com/tesseract-ocr/tessdata_best/blob/main/eng.traineddata
>>>
>>> also tried some of possibilities in
>>> https://github.com/ocrmypdf/OCRmyPDF/issues/209
>>>
>>> I am looking for the source of the issue ---> could someone help if
>>> understood the source. so I can work further.
>>> On Tuesday, January 19, 2021 at 5:30:46 PM UTC+1 Shree Devi Kumar wrote:
>>>
 >*wget 
 >https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
 *

 That is not correct. You need to get the `raw` file.


 https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata

 *wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
 *


 On Tue, Jan 19, 2021 at 9:49 PM Roparzh Hemon 
 wrote:

>
> I downloaded it as you suggested, and as the terminal output below
> shows, the file is now present at the correct place :
>
> $file /home/mbalambala/tesseract/tessdata/eng.traineddata
> /home/mbalambala/tesseract/tessdata/eng.traineddata : HTML document,
> UTF-8 Unicode text, with very long lines
>
> $ echo TESSDATA_PREFIX
> /home/mbalambala/tesseract/tessdata
>
> but the error message stays exactly the same :
>
> $ tesseract Downloads/p1.pdf p1
> Error opening data file
> /home/mbalambala/tesseract/tessdata/eng.traineddata
> Please make sure the TESSDATA_PREFIX environment variable is set to
> your "tessdata" directory.
> Failed loading language 'eng'
> Tesseract couldn't load any languages!
> Could not initialize tesseract.
>
>
> Whatever the real problem is, the error message is not detecting it.
>
> On Sunday, January 17, 2021 at 10:37:22 AM UTC+1 ... wrote:
>
>> Run the following command in order to get the eng.traineddata file
>> within the tessdata directory: *wget
>> https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
>> *
>>
>
>
>
> --
>
 You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com.
>
 To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/47e8b734-5de9-4624-8872-ed91ac8775b4n%40googlegroups.com
> 
> .
>


 --

 
 भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>>
>> To view this discussion on the web visit
>>> https://groups.google.c

Re: [tesseract-ocr] Beginner question : could not initialize tesseract, missing eng.traineddata file in tessdata

2024-04-22 Thread Zdenko Podobny

No, you are not using best float tessdata files from:
https://github.com/tesseract-ocr/tessdata_best/blob/main/eng.traineddata
There is nothing like eng_pcb.traineddata. (read your error message)


Zdenko


po 22. 4. 2024 o 17:40 Surya VaraPrasad Alla 
napísal(a):

> Hello,
>
> I have the similar response
>
> pytesseract.pytesseract.TesseractError: (1, "read_params_file: Can't open
> tessedit_char_blacklist=,;: Error: Tesseract (legacy) engine requested, but
> components are not present in
> external/tesstrain/data/eng_pcb/eng_pcb.traineddata!! Failed loading
> language 'eng_pcb' Tesseract couldn't load any languages! Could not
> initialize tesseract.")
>
> tesseract --version:
> tesseract -v
> tesseract 4.1.1
>  leptonica-1.82.0
>   libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 :
> libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
>  Found AVX512BW
>  Found AVX512F
>  Found AVX2
>  Found AVX
>  Found FMA
>  Found SSE
>  Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8
> liblz4/1.9.3 libzstd/1.4.8
>
> I am using best float tessdata files from:
> https://github.com/tesseract-ocr/tessdata_best/blob/main/eng.traineddata
>
> also tried some of possibilities in
> https://github.com/ocrmypdf/OCRmyPDF/issues/209
>
> I am looking for the source of the issue ---> could someone help if
> understood the source. so I can work further.
> On Tuesday, January 19, 2021 at 5:30:46 PM UTC+1 Shree Devi Kumar wrote:
>
>> >*wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
>> *
>>
>> That is not correct. You need to get the `raw` file.
>>
>> https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata
>>
>> *wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
>> *
>>
>>
>> On Tue, Jan 19, 2021 at 9:49 PM Roparzh Hemon 
>> wrote:
>>
>>>
>>> I downloaded it as you suggested, and as the terminal output below
>>> shows, the file is now present at the correct place :
>>>
>>> $file /home/mbalambala/tesseract/tessdata/eng.traineddata
>>> /home/mbalambala/tesseract/tessdata/eng.traineddata : HTML document,
>>> UTF-8 Unicode text, with very long lines
>>>
>>> $ echo TESSDATA_PREFIX
>>> /home/mbalambala/tesseract/tessdata
>>>
>>> but the error message stays exactly the same :
>>>
>>> $ tesseract Downloads/p1.pdf p1
>>> Error opening data file
>>> /home/mbalambala/tesseract/tessdata/eng.traineddata
>>> Please make sure the TESSDATA_PREFIX environment variable is set to your
>>> "tessdata" directory.
>>> Failed loading language 'eng'
>>> Tesseract couldn't load any languages!
>>> Could not initialize tesseract.
>>>
>>>
>>> Whatever the real problem is, the error message is not detecting it.
>>>
>>> On Sunday, January 17, 2021 at 10:37:22 AM UTC+1 ... wrote:
>>>
 Run the following command in order to get the eng.traineddata file
 within the tessdata directory: *wget
 https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
 *

>>>
>>>
>>>
>>> --
>>>
>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>>
>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/47e8b734-5de9-4624-8872-ed91ac8775b4n%40googlegroups.com
>>> 
>>> .
>>>
>>
>>
>> --
>>
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c0a86f51-b876-40ba-8d46-afdc3eccc96dn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y8f9X%2BUcRa8nADS3JDbS8Gn%3DZPtszgafmcSe3dt8yz1Q%40mail.gmail.com.

Re: [tesseract-ocr] tesseract misleading in 8 and 6

2024-04-18 Thread Zdenko Podobny

Unfortunately, your post is very vague. Unless you provide a detailed
description of what you are doing (step-by-step so we can replicate it),
nobody can help you.


Zdenko


st 17. 4. 2024 o 12:14 Jayrajsinh Zala 
napísal(a):

> I train tesseract ocr using MATLAB and use specific train data file but
> still getting error in 8 and 6 .
>
> I attach all images that i used for training and i am getting error for
> same type of images.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/fc82a651-72ac-48d6-9f50-a754bfc0abc6n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zM%2B7VSqjyiZPEA4s6Q0ef3iZbyfZ82-4NHs92u%2BBJ79w%40mail.gmail.com.

Re: [tesseract-ocr] Getting Error: No such file or directory: 'data/foo/all-lstmf'

2024-03-27 Thread Zdenko Podobny

You can try custom images - see the example  ocrd-testset.zip
 And
follow the example from
https://github.com/tesseract-ocr/tesstrain/blob/main/README.md :

unzip ocrd-testset.zip -d data/ocrd-ground-truth
make training MODEL_NAME=ocrd START_MODEL=deu_latf
TESSDATA=~/tessdata_best MAX_ITERATIONS=1


Zdenko


so 28. 10. 2023 o 17:37 Dev Solution  napísal(a):

> Can I train my custom images? I'm going to build France Receipts scanner.
> So I need to train these all to increase accuracy. How do you suggest?
> Zdenop
>
> On Saturday, October 28, 2023 at 11:58:10 AM UTC+2 zdenop wrote:
>
>> It does not work on windows (directly) but it works on linux => use WSL
>> if you really need training.
>> Or wait until somebody find a fix for windows (or send the fix - this is
>> an open source project so everybody should contribute ;-) )
>>
>> Zdenko
>>
>>
>> pi 27. 10. 2023 o 17:32 Dev Solution  napísal(a):
>>
>>>
>>> I just tried to run these all commands, but I got error
>>> https://prnt.sc/lLHeR27J2U65
>>>
>>> On Tuesday, June 6, 2023 at 10:03:17 AM UTC+2 zdenop wrote:
>>>
 Do not create files manually.
 If "make training" does not work it means:

1. you miss some dependency or input data are wrong
2. also you miss error message for 1.

 I strongly suggest you to start training from the beginning
 (including cloning tesstraing) and pay attention to all messages:

 git clone --depth 1 https://github.com/tesseract-ocr/tesstrain.git
 cd tesstrain
 make tesseract-langdata
 mkdir tessdata_best
 wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
 -P tessdata_best
 unzip ocrd-testset.zip -d data/ocrd-ground-truth
 make training MODEL_NAME=ocrd TESSDATA=tessdata_best
 MAX_ITERATIONS=1


 Zdenko


 po 5. 6. 2023 o 4:22 Madhav Pandey  napísal(a):

> Hi Zdenop,
>
> Apologies. I got your name wrong in the thread.
>
> Can you please help me in resolving this issue? Because make training
> command was not creating the all-gt file. I manually created it and kept 
> it
> at the MODEL_NAME directory.
>
> The way I created it was by copy over all the single lines from the
> text files and storing it in the all-gt file. I am not sure if this is the
> right approach. Please correct me if I am wrong here.
>
> Now after doing this, i am getting this error:
>
> python3 shuffle.py 0 "data/Apex/all-lstmf"
> Traceback (most recent call last):
>   File
> "/Users/madpande/Code/git/tesseract_tutorial/tesstrain/shuffle.py", line
> 24, in 
> fd0 = open(sys.argv[2], 'r')
> FileNotFoundError: [Errno 2] No such file or directory:
> 'data/Apex/all-lstmf'
>
>
> I am pretty sure I am missing something here. Please help!
>
> Thanks!
>
> On Thursday, 1 June 2023 at 23:39:01 UTC-6 Madhav Pandey wrote:
>
>> Hi Zdenko,
>>
>> At what step in the make file the all-gt file is created? I am still
>> unable to move forward with the custom model training.
>>
>> Any help would be greatly appreciated. Thanks!
>>
>> On Wednesday, 26 April 2023 at 09:47:55 UTC-6 zdenop wrote:
>>
>>> make training TESSDATA=./usr/local/share/tessdata
>>> unicharset_extractor --output_unicharset "data/foo/unicharset"
>>> --norm_mode 2 "data/foo/all-gt"
>>>
>>> Failed to read data from: data/foo/all-gt
>>>
>>>
>>> This indicates you already run training that failed...
>>> Clean your training and start it once again. Pay attention to why
>>> "data/foo/all-gt" is not created (there will be an error message).
>>>
>>> Zdenko
>>>
>>>
>>> st 26. 4. 2023 o 2:07 Madhav Pandey 
>>> napísal(a):
>>>
 @zdenop

 This is the entire training output:

 ```make training TESSDATA=./usr/local/share/tessdata
 unicharset_extractor --output_unicharset "data/foo/unicharset"
 --norm_mode 2 "data/foo/all-gt"
 Failed to read data from: data/foo/all-gt
 Wrote unicharset file data/foo/unicharset
 PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i
 "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif" -t
 "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.gt.txt" >
 "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.box"
 set -x; \
 tesseract
 "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif"
 data/foo-ground-truth/alexis_ruhe01_1852_0087_027 --psm 13 lstm.train
 + tesseract data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif
 data/foo-ground-truth/alexis_ruhe01_1852_0087_027 --psm 13 lstm.train
 PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i
 "data/foo-ground-truth/alexis_ruhe01_1852_0

Re: [tesseract-ocr] fine tuning on images

2024-03-27 Thread Zdenko Podobny

You can easily test your hypothesis by modifying Makefile[1] lines from
tesseract "$<" $* --psm $(PSM) lstm.train
to
   tesseract "$<" $* --psm $(PSM) -l $(START_MODEL) lstm.train

[1]
https://github.com/tesseract-ocr/tesstrain/blob/19f79e2d38dfeada41a96c8d87426c85a7eaa454/Makefile#L242-L255

Zdenko


št 14. 3. 2024 o 11:04 roei shlezinger  napísal(a):

> Hello, I have relatively clear images in Hebrew and Tesseract produces
> reasonable but not perfect results. I thought about continuing to train the
> model to make them better but ran into a problem. Here is the command I run:
>
> "bash-4.4# make training MODEL_NAME=test11
> GROUND_TRUTH_DIR=/home/tesstrain/data/files START_MODEL=heb PSM=7 DPI=96
> DEBUG_INTERVAL=-1 MAX_ITERATIONS=100"
>
> While training I get the following results. Note that the percentage is
> over 100:
> "At iteration 10/10/10, Mean rms=11.396%, delta=111.114%, char
> train=146.702%, word train=100%, skip ratio=0%, New worst char error =
> 146.702 wrote checkpoint."
>
> I have a hypothesis as to why this happens: during the training process I
> get the output below. The important line in it is this:
> "PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i
> "/home/tesstrain/data/files/MR_1.1.tif" -t
> "/home/tesstrain/data/files/MR_1.1.gt.txt" > "
> /home/tesstrain/data/files/MR_1.1.box"
> + tesseract /home/tesstrain/data/files/MR_1.1.tif
> /home/tesstrain/data/files/MR_1.1 --psm 7 lstm.train"
> This gives me in the GROUND_TRUTH_DIR folder an additional file with lstmf
> extensions and an additional file with txt extension. The txt file is empty
> except for one up arrow character. It seems that during the training,
> tesseract is activated and it does not receive a Hebrew language parameter
> and therefore fails to recognize the text. I'm not sure that's the problem,
> but I'm sure the training failed. Does anyone have an idea what I'm doing
> wrong? I would appreciate any help, thanks Roy.
> Full output mode:
>
> bash-4.4# make training MODEL_NAME=test4
> GROUND_TRUTH_DIR=/home/tesstrain/data/files START_MODEL=heb PSM=7 DPI=96
> DEBUG_INTERVAL=-1 MAX_ITERATIONS=100
> find -L /home/tesstrain/data/files -name '*.gt.txt' | xargs paste -s >
> "data/test4/all-gt"
> combine_tessdata -u /home/tesstrain/usr/share/tessdata/heb.traineddata
>  data/heb/test4
> Extracting tessdata components from
> /home/tesstrain/usr/share/tessdata/heb.traineddata
> Wrote data/heb/test4.lstm
> Wrote data/heb/test4.lstm-punc-dawg
> Wrote data/heb/test4.lstm-word-dawg
> Wrote data/heb/test4.lstm-number-dawg
> Wrote data/heb/test4.lstm-unicharset
> Wrote data/heb/test4.lstm-recoder
> Wrote data/heb/test4.version
> Version
> string:4.00.00alpha:heb:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx192O1c1]
> 17:lstm:size=3022651, offset=192
> 18:lstm-punc-dawg:size=1378, offset=3022843
> 19:lstm-word-dawg:size=673826, offset=3024221
> 20:lstm-number-dawg:size=1298, offset=3698047
> 21:lstm-unicharset:size=4023, offset=3699345
> 22:lstm-recoder:size=625, offset=3703368
> 23:version:size=80, offset=3703993
> unicharset_extractor --output_unicharset "data/test4/my.unicharset"
> --norm_mode 2 "data/test4/all-gt"
> Bad box coordinates in boxfile string! ויצעק משה אל יהוה על דבר הצפרדעים
> אשר
> Extracting unicharset from plain text file data/test4/all-gt
> Wrote unicharset file data/test4/my.unicharset
> merge_unicharsets data/heb/test4.lstm-unicharset data/test4/my.unicharset
>  "data/test4/unicharset"
> Loaded unicharset of size 69 from file data/heb/test4.lstm-unicharset
> Loaded unicharset of size 30 from file data/test4/my.unicharset
> Wrote unicharset file data/test4/unicharset.
> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i
> "/home/tesstrain/data/files/MR_1.0.tif" -t
> "/home/tesstrain/data/files/MR_1.0.gt.txt" >
> "/home/tesstrain/data/files/MR_1.0.box"
> + tesseract /home/tesstrain/data/files/MR_1.0.tif
> /home/tesstrain/data/files/MR_1.0 --psm 7 lstm.train
> Tesseract Open Source OCR Engine v4.1.0 with Leptonica
> Page 1
> Warning: Invalid resolution 0 dpi. Using 70 instead.
> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i
> "/home/tesstrain/data/files/MR_1.1.tif" -t
> "/home/tesstrain/data/files/MR_1.1.gt.txt" >
> "/home/tesstrain/data/files/MR_1.1.box"
> + tesseract /home/tesstrain/data/files/MR_1.1.tif
> /home/tesstrain/data/files/MR_1.1 --psm 7 lstm.train
> Tesseract Open Source OCR Engine v4.1.0 with Leptonica
> Page 1
> Warning: Invalid resolution 0 dpi. Using 70 instead.
> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i
> "/home/tesstrain/data/files/MR_1.10.tif" -t
> "/home/tesstrain/data/files/MR_1.10.gt.txt" >
> "/home/tesstrain/data/files/MR_1.10.box"
> + tesseract /home/tesstrain/data/files/MR_1.10.tif
> /home/tesstrain/data/files/MR_1.10 --psm 7 lstm.train
> Tesseract Open Source OCR Engine v4.1.0 with Leptonica
> combine_lang_model \
>   --input_unicharset data/test14/unicharset \
>   --script_dir data \
>   --numbers data/test14/test14.numbers \
>   --puncs da

Re: [tesseract-ocr] Lack of accuracy on reading numbers

2024-03-27 Thread Zdenko Podobny

Always test the command line if there is an issue with the wrapper.

tesseract -v
tesseract 5.3.4-44-g2b07
 leptonica-1.84.0 (Dec 31 2023, 23:36:37) [MSC v.1929 LIB Release x64]
  libgif 5.1.2 : libjpeg 6b (libjpeg-turbo 2.1.90) : libpng 1.6.40 :
libtiff 4.6.0 : zlib 1.2.13.zlib-ng : libwebp 1.3.2 : libopenjp2 2.5.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 200203

tesseract teach_t2.png -
3

tesseract teach_t2.png - --psm 8
C 3111

tesseract teach_t2.png - --psm 7
3


seem like psm 8 is not suitable in this case.

Zdenko


st 27. 3. 2024 o 7:42 Ajay Pandya  napísal(a):

> Hello Everyone,
>
> I am using tesseract 5.2 with C#. Having problem in reading this number.
>
> PSM : 8
> OEM : 3
> Train file : eng (Best)
>
> Data : 3, Reading 3111.
>
> We have many same images with different numbers. Sometimes it adds extra
> number and some times it removes.
>
> Kindly help with this problem.
>
> Thanks.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/3664ab29-85a5-49ad-9066-789293feaefdn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x8TUhbEF%2BjunrdP6ytO3A1xp0DnnOdnX-OAT-jBp-hLA%40mail.gmail.com.

Re: [tesseract-ocr] Reading large gray images with only numbers yields incorrect results

2024-03-26 Thread Zdenko Podobny

Yes, we have suggestions for me to improve the accuracy of the results -
they are already in the documentation. Just read it.

Zdenko


ut 26. 3. 2024 o 13:41 inKi Wang  napísal(a):

> Hi everyone, I wish you all a good day.
>
> I'm currently encountering an issue with image_to_string producing
> incorrect results when reading large gray images containing only numbers.
> Here's what I'm using:
>
>- pytesseract version 0.3.10
>- tesseractOCR version 5.3.3
>- Language: English (eng)
>- PSM: 7
>- OEM: 3
>
> With the image provided below, the result returned when using the
> image_to_string function is 9.5. When I resize the image, it returns 9.0,
> 9.5, and sometimes *9.9*. There was an instance where resizing gave 5.5,
> but it was incorrect for other cases with different numbers.
>
> Do you have any suggestions for me to improve the accuracy of the results?
> Thank you all, and I wish you a great day!
>
> [image: 9.0.png]
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/e0ee2568-95cc-42bc-aba7-7d39e8083db8n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zUTC9QrS74TXZxDjW%2BBHjCy%3D9pS4W3b26yQmUcz9v07A%40mail.gmail.com.

Re: [tesseract-ocr] Does training new images increase the size of the traindata file?

2024-03-26 Thread Zdenko Podobny

Unless you provide information about what you do, and the possibility to
replicate your process (providing input data) we do not know what is wrong
with it.
Did you check the example for official training[1]?

In my case I see this:
Output has size 7485144 (`ls -l data/ocrd.traineddata`) while startmodel
frk has size 12938047 (best model) and its fast model has size (6423052)

[1] https://github.com/tesseract-ocr/tesstrain

Zdenko


st 13. 3. 2024 o 6:43 Cain Pian  napísal(a):

> I've trained thousands of images. But the traineddata file size didn't
> change at all.
> Did I do something wrong?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/46067294-666a-4269-bcd1-aafa0d71fc4en%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zy4Qi_YR%3DSOiw1asriWPYJT-XhokCGF58iiVwLhd4%2BZQ%40mail.gmail.com.

Re: [tesseract-ocr] Leptonica directory

2024-03-12 Thread Zdenko Podobny

It seems like you are not following the official documented way for
compiling leptonica and tesseract. Follow it. Then we can help you.


Zdenko


st 13. 3. 2024 o 6:43 Ravil R  napísal(a):

> Windows, msvc 2022, win32, I've got some questions regarding compilation
> 1) How to specify the directory where Leptonica is installed? No matter
> what I tried sln file every time contains *c:\Program
> Files(x86)\Leptonica*
> 2) Leptonica is definitely compiled with libtiff support:
> *-- Used TIFF library: C:/Soft/tesslibs/libtiff/Win32/lib/tiff.lib*
> but tesseract thinks it is not:
> *Leptonica was build without TIFF support...*
> 3) How to build both Leptonica and Tesseract as static libraries?
> BUILD_STATIC_LIBS doesn't work and cmake says it is not used
> 4) How to specify th ICU library files location?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c23933a4-c828-45f2-a158-f25e3044f472n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yx5uASTcrRtnrP8m0aAu7yJ2H3NQaWogB1MDb3vcquow%40mail.gmail.com.

Re: [tesseract-ocr] user patterns with tesserocr python API

2024-03-12 Thread Zdenko Podobny

One correction:

I checked the example in the below mentioned url with the Tesseract
executable and tessdata repository. The result is that user_pattern is
effecting also LSTM. This could be easily tested by generating output
without user_patters (Arial.txt):

tesseract Arial.png Arial

And with patterns:
tesseract Arial.png Arial.pat --user-patterns my.patterns
tesseract Arial.png Arial.pat.oem0 --user-patterns my.patterns --oem 0
tesseract Arial.png Arial.pat.oem1 --user-patterns my.patterns --oem 1
tesseract Arial.png Arial.pat.oem2 --user-patterns my.patterns --oem 2

Zdenko


ne 10. 3. 2024 o 17:32 Zdenko Podobny  napísal(a):

> Maybe I am wrong, but it looks to me like you are expecting from
> user-patterns something it never promises to provide.
> What we know/experienced:
>
>- user-patterns extends the Tesseract legacy engine dictionary.
>- putting a word/pattern to the Tesseract Legacy Engine dictionary
>never guarantees word is recognized correctly (see remark
>https://tesseract-ocr.github.io/tessdoc/APIExample-user_patterns.html)
>- somebody (I can not find details as it was a long time ago) made
>tests and he found that the Tesseract legacy engine dictionary has limited
>effect. For "nonword" text (like "codes" with mixed letter&digits" people
>usually turn off the dictionary)
>- some users prefer to use the Legacy engine for "codes" instead of
>LSTM
>
> As far as I know, nobody made tests regarding LSTM and dictionaries e.g.
> an investigation if user-patterns also affect LSTM engine (as for LSTM
> there are new dictionary
> components lstm-punc-dawg, lstm-punc-dawg, lstm-number-dawg) ...
>
>
> Zdenko
>
>
> ne 3. 3. 2024 o 23:02 Roman Seidel  napísal(a):
>
>> To be more precise with my questions:
>>
>> - Is the user-patterns functiontionality implemented in the tesserocr
>> Python API of tesseract?
>> - How exact is the syntax of specifying user patterns with the tesserocr
>> Python API. Is SetVariable() correct and how is the path (Linux) and the
>> attribute specified?
>> - is there a default path, where it is lookes for the *.patterns /
>> *.user-patterns file
>>
>> With the attached code from my last message, I've tested different
>> constellations with/without the combination of whitelist, different
>> atrributes and path notations, which was not successfull.
>>
>> If I use the following notation for user patterns, it has no effect on
>> the results independently from the entries of the *.patterns file:
>>
>> api.SetVariable('user_patterns_file',
>> '/home/roman/Dev_d/playground/user_patterns/deu.patterns')
>>
>> Does anyone has (successfully) used user patterns with the tesserocr
>> Python API of tesseract?
>>
>> best wishes and thanks, Roman
>>
>>
>> Am Sa., 2. März 2024 um 13:08 Uhr schrieb Zdenko Podobny <
>> zde...@gmail.com>:
>>
>>> Can you please elaborate on:
>>>
>>> Nevertheless, user patterns is not working in the way described above.
>>>
>>>
>>>
>>> Zdenko
>>>
>>>
>>> so 2. 3. 2024 o 10:45 Roman Seidel 
>>> napísal(a):
>>>
>>>> Yes, sure, the input file is a snippet with a capital letter followed
>>>> by 9 digits. The correct user pattern, corresponding to [1] is:
>>>>
>>>> ``\A\d\d\d\d\d\d\d\d\d``
>>>>
>>>> The result of Tesseract (psm 8) is fully correct. Nevertheless, user
>>>> patterns is not working in the way described above.
>>>>
>>>> For instance, I have tried to extract only the capital character with
>>>> user patterns (not with whitelist), which is:
>>>>
>>>> \A
>>>>
>>>> In this case, the capital letter and all digits are given back by
>>>> tesseract.
>>>>
>>>> I've attached my input file and the corresponding Python snippet for
>>>> reading and proessing the image with tesserocr from [2]
>>>>
>>>>
>>>> [1]
>>>> https://github.com/tesseract-ocr/tesseract/blob/main/src/dict/trie.h#L197
>>>> [2] https://github.com/sirfz/tesserocr
>>>>
>>>>
>>>>
>>>> Am Fr., 1. März 2024 um 18:59 Uhr schrieb René JM Clais <
>>>> reneclai...@gmail.com>:
>>>>
>>>>> Can you send an example of an input document and the output of
>>>>> tesseract as well of what should be your expectation using the pattern
>>>>> file.
>>>

Re: [tesseract-ocr] user patterns with tesserocr python API

2024-03-10 Thread Zdenko Podobny

Maybe I am wrong, but it looks to me like you are expecting from
user-patterns something it never promises to provide.
What we know/experienced:

   - user-patterns extends the Tesseract legacy engine dictionary.
   - putting a word/pattern to the Tesseract Legacy Engine dictionary never
   guarantees word is recognized correctly (see remark
   https://tesseract-ocr.github.io/tessdoc/APIExample-user_patterns.html)
   - somebody (I can not find details as it was a long time ago) made tests
   and he found that the Tesseract legacy engine dictionary has limited
   effect. For "nonword" text (like "codes" with mixed letter&digits" people
   usually turn off the dictionary)
   - some users prefer to use the Legacy engine for "codes" instead of LSTM

As far as I know, nobody made tests regarding LSTM and dictionaries e.g.
an investigation if user-patterns also affect LSTM engine (as for LSTM
there are new dictionary
components lstm-punc-dawg, lstm-punc-dawg, lstm-number-dawg) ...


Zdenko


ne 3. 3. 2024 o 23:02 Roman Seidel  napísal(a):

> To be more precise with my questions:
>
> - Is the user-patterns functiontionality implemented in the tesserocr
> Python API of tesseract?
> - How exact is the syntax of specifying user patterns with the tesserocr
> Python API. Is SetVariable() correct and how is the path (Linux) and the
> attribute specified?
> - is there a default path, where it is lookes for the *.patterns /
> *.user-patterns file
>
> With the attached code from my last message, I've tested different
> constellations with/without the combination of whitelist, different
> atrributes and path notations, which was not successfull.
>
> If I use the following notation for user patterns, it has no effect on the
> results independently from the entries of the *.patterns file:
>
> api.SetVariable('user_patterns_file',
> '/home/roman/Dev_d/playground/user_patterns/deu.patterns')
>
> Does anyone has (successfully) used user patterns with the tesserocr
> Python API of tesseract?
>
> best wishes and thanks, Roman
>
>
> Am Sa., 2. März 2024 um 13:08 Uhr schrieb Zdenko Podobny  >:
>
>> Can you please elaborate on:
>>
>> Nevertheless, user patterns is not working in the way described above.
>>
>>
>>
>> Zdenko
>>
>>
>> so 2. 3. 2024 o 10:45 Roman Seidel  napísal(a):
>>
>>> Yes, sure, the input file is a snippet with a capital letter followed by
>>> 9 digits. The correct user pattern, corresponding to [1] is:
>>>
>>> ``\A\d\d\d\d\d\d\d\d\d``
>>>
>>> The result of Tesseract (psm 8) is fully correct. Nevertheless, user
>>> patterns is not working in the way described above.
>>>
>>> For instance, I have tried to extract only the capital character with
>>> user patterns (not with whitelist), which is:
>>>
>>> \A
>>>
>>> In this case, the capital letter and all digits are given back by
>>> tesseract.
>>>
>>> I've attached my input file and the corresponding Python snippet for
>>> reading and proessing the image with tesserocr from [2]
>>>
>>>
>>> [1]
>>> https://github.com/tesseract-ocr/tesseract/blob/main/src/dict/trie.h#L197
>>> [2] https://github.com/sirfz/tesserocr
>>>
>>>
>>>
>>> Am Fr., 1. März 2024 um 18:59 Uhr schrieb René JM Clais <
>>> reneclai...@gmail.com>:
>>>
>>>> Can you send an example of an input document and the output of
>>>> tesseract as well of what should be your expectation using the pattern
>>>> file.
>>>>
>>>> Le jeu. 29 févr. 2024 à 21:40, Roman Seidel 
>>>> a écrit :
>>>>
>>>>> Hi all,
>>>>>
>>>>> I am currently try to use user-patterns on the PyTessBaseAPI from
>>>>> tesserocr [1].
>>>>>
>>>>> What I've done is to initialize the API with:
>>>>>
>>>>> with PyTessBaseAPI(path='/usr/share/tesseract-ocr/4.00/tessdata', lang
>>>>> =LANGUAGE, psm=int(psm), oem=int(TOEM)) as api:
>>>>>
>>>>> setting the user patterns file with:
>>>>>
>>>>> api.SetVariable('user_patterns_file',
>>>>> '/home/roman/Dev_d/playground/user_patterns/deu.patterns')
>>>>>
>>>>> Where the user patterns file contains a pattern, e.g.:
>>>>>
>>>>> \A\A\A
>>>>>
>>>>> (which means three characters in capital letters.
>>>>>

Re: [tesseract-ocr] Re: Post OCR Verification and Editing

2024-03-08 Thread Zdenko Podobny

Hello,


I am not sure if OCRmyPDF(https://ocrmypdf.readthedocs.io/en/latest/)
allows redaction.

If you would to implement text layer by yourself with custom font, have a
look at PyMuPDF:

   - https://github.com/pymupdf/PyMuPDF/discussions/775 (Adding text layer
   to a scanned PDF)
   - https://github.com/pymupdf/PyMuPDF/discussions/2464 (invisible text
   layer)


Zdenko


št 7. 3. 2024 o 20:53 Mark Pellegrino  napísal(a):

> I found more info here:
>
> https://github.com/tesseract-ocr/tesseract/issues/1769#issuecomment-509490277
>
> Glyphless appears to be an 'invisible font' and all that Tesseract
> supports. It seems like the solution it to use Tesseract to generate hOCR,
> then use another tool to combine the source image with the hOCR?
>
> Does anyone have a simple workflow for editing/correcting Tesseract OCR
> documents that they can share?
>
> Thanks again,
>
> On Thursday 7 March 2024 at 14:17:28 UTC-5 Mark Pellegrino wrote:
>
>> Hello,
>> I'm trying to check PDFs made with Tesseract 5.2 for correctness using an
>> OCR editor but am unable to open them in either Abbyy or Acrobat.
>>
>> If I try to open a Tesseract PDF with Abbyy FineReader/OCR Editor, the
>> software just hangs and crashes. I can open Tesseract PDFs with Acrobat
>> Pro, but when I enable the  'Make OCR text visible' option in Preflight,
>> all of the text layer turns into unreadable black boxes. The font used
>> shows as 'GlyphLessFont' and appears to be embedded in the file.
>>
>> It doesn't matter what training data I use, or what the source image was,
>> I always get these results. Any other non-Tesseract made PDF works just
>> fine. I'm guessing that the issue is a missing font? I don't have much of
>> an understanding about how embedded PDF fonts work and I haven't found
>> anything about this in the Tesseract docs. Can someone please point me in
>> the right direction? I Thanks.
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wqM%2BE7KZ6_etfi6y8B_bLeZE4fRYns-TY3Yp%3DGhFjr7g%40mail.gmail.com.

Re: [tesseract-ocr] user patterns with tesserocr python API

2024-03-02 Thread Zdenko Podobny

Can you please elaborate on:

Nevertheless, user patterns is not working in the way described above.



Zdenko


so 2. 3. 2024 o 10:45 Roman Seidel  napísal(a):

> Yes, sure, the input file is a snippet with a capital letter followed by 9
> digits. The correct user pattern, corresponding to [1] is:
>
> ``\A\d\d\d\d\d\d\d\d\d``
>
> The result of Tesseract (psm 8) is fully correct. Nevertheless, user
> patterns is not working in the way described above.
>
> For instance, I have tried to extract only the capital character with user
> patterns (not with whitelist), which is:
>
> \A
>
> In this case, the capital letter and all digits are given back by
> tesseract.
>
> I've attached my input file and the corresponding Python snippet for
> reading and proessing the image with tesserocr from [2]
>
>
> [1]
> https://github.com/tesseract-ocr/tesseract/blob/main/src/dict/trie.h#L197
> [2] https://github.com/sirfz/tesserocr
>
>
>
> Am Fr., 1. März 2024 um 18:59 Uhr schrieb René JM Clais <
> reneclai...@gmail.com>:
>
>> Can you send an example of an input document and the output of tesseract
>> as well of what should be your expectation using the pattern file.
>>
>> Le jeu. 29 févr. 2024 à 21:40, Roman Seidel  a
>> écrit :
>>
>>> Hi all,
>>>
>>> I am currently try to use user-patterns on the PyTessBaseAPI from
>>> tesserocr [1].
>>>
>>> What I've done is to initialize the API with:
>>>
>>> with PyTessBaseAPI(path='/usr/share/tesseract-ocr/4.00/tessdata', lang=
>>> LANGUAGE, psm=int(psm), oem=int(TOEM)) as api:
>>>
>>> setting the user patterns file with:
>>>
>>> api.SetVariable('user_patterns_file',
>>> '/home/roman/Dev_d/playground/user_patterns/deu.patterns')
>>>
>>> Where the user patterns file contains a pattern, e.g.:
>>>
>>> \A\A\A
>>>
>>> (which means three characters in capital letters.
>>>
>>>
>>> The result, independently ,whether I use the user_patterns_file argument
>>> or not, are the same. This brings me to the question if tesserocr supports
>>> user (and word) patterns?
>>>
>>> My versions:
>>>
>>> tesserocr 2.6.2
>>> tesseract 5.3.3
>>>  leptonica-1.83.1
>>>   libpng 1.6.34 : zlib 1.2.11
>>>
>>> Thanks a lot for your help and best wishes,
>>> Roman
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/767cc60f-5325-43d7-a6ef-9cf879f82950n%40googlegroups.com
>>> 
>>> .
>>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/tesseract-ocr/MMtdkQu3vSM/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_ok%2BQec6cJ1fxfb5NOqLVr8MAovZMNdXT-N3QS3di%2B%3Dng%40mail.gmail.com
>> 
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAL%3DSc5v%3DLm8Bf_5qE2yaFGb7sY99%3DLceSWTqEk8DMMR_GYWjeg%40mail.gmail.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xiLJ4ud%2B3hH1Jp0F-9z5ep_NwLyUUtwbcqreGbA81JTg%40mail.gmail.com.

Re: [tesseract-ocr] How to correctly define CMakeLists.txt for Tesseract?

2024-02-20 Thread Zdenko Podobny

Any reason why to use an external 3rd party app that is not available on
all platforms instead of cmake native function which is available
everywhere cmake is?

Zdenko


ut 20. 2. 2024 o 18:02 Tom Morris  napísal(a):

> On Monday, February 19, 2024 at 4:49:07 AM UTC-5 raphael.s...@gmail.com
> wrote:
>
> I solved the issue, thanks to the help and suggestions, and explanations,
> I kindly received in StackOverFlow
>
>
> In case someone has a similar issue in the future, the suggestion on
> StackOverflow was to use CMake's INTERFACE_LINK_LIBRARIES property.
>
> ...
> pkg_check_modules(tesseract REQUIRED IMPORTED_TARGET tesseract)
> # pkg-config doesn't know about dependencies of static libraries, so add
> these dependencies manually.
> set_property(TARGET PkgConfig::tesseract APPEND PROPERTY
> INTERFACE_LINK_LIBRARIES curl)
>
> target_link_libraries(BasicExample PUBLIC
> PkgConfig::tesseract # With that linkage CMake will automatically add
> linkage with curl.
> )
>
> There's more explanation in the StackOverflow answer.
>
> Tom
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/98aca687-90ae-4e57-a8c4-79356a26eac7n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8ziUterVtbWgq88Z5JTfp0LwJPCZ4-LeAXJzcW%3DHUWwJg%40mail.gmail.com.

Re: [tesseract-ocr] How to correctly define CMakeLists.txt for Tesseract?

2024-02-17 Thread Zdenko Podobny

First of all: you should use tools you are familiar with. Your CMake
configuration (CMakeLists.txt) does not look that way (you would use CMake
to check required libraries, not PkgConfig, you would not hardcode curl for
linking etc...),

Next. you should provide all the details to replicate the problem. What is
missing (at least):

   1. BasicExample.cpp
   2. What OS you use (you put "WIN32 MACOSX_BUNDLE" to add_executable,
   what is quite surprising combination)
   3. How did you install Tesseract? Some error outputs indicate manual
   static linked tesseract. Why a static build? It could be quite tricky to
   link static library with external dependencies

Best regards,

Zdenko


so 17. 2. 2024 o 19:06 Raphael Stonehorse 
napísal(a):

> As described and discussed here:
> https://stackoverflow.com/questions/78011753/how-to-correctly-define-cmakelists-txt-for-tesseract
> I've been trying to use CMake for Tesseract compilation and building
>
> With this CMakeLists.txt :
>
> cmake_minimum_required(VERSION 3.5)
> project(BasicExample)
>
> set(CMAKE_CXX_STANDARD 17)
>
> find_package(PkgConfig REQUIRED)
>
> pkg_check_modules(tesseract REQUIRED IMPORTED_TARGET tesseract)
> pkg_check_modules(leptonica REQUIRED IMPORTED_TARGET lept)
> pkg_check_moduleS(libcurl REQUIRED IMPORTED_TARGET libcurl)
>
> add_executable(${PROJECT_NAME} WIN32 MACOSX_BUNDLE BasicExample.cpp)
>
> target_link_libraries(BasicExample PUBLIC
> PkgConfig::leptonica
> PkgConfig::tesseract
> -lcurl
> )
>
> I get these errors:
>
> raphy@raohy:~/tesseract/Examples$ cmake -B builddir
> -- The C compiler identification is GNU 12.3.0
> -- The CXX compiler identification is GNU 13.2.0
> -- Detecting C compiler ABI info
> -- Detecting C compiler ABI info - done
> -- Check for working C compiler: /usr/bin/cc - skipped
> -- Detecting C compile features
> -- Detecting C compile features - done
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Check for working CXX compiler: /usr/bin/c++ - skipped
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> -- Found PkgConfig: /usr/bin/pkg-config (found version "1.8.1")
> -- Checking for module 'tesseract'
> --   Found tesseract, version 5.3.4
> -- Checking for module 'lept'
> --   Found lept, version 1.82.0
> -- Checking for module 'libcurl'
> --   Found libcurl, version 8.2.1
> -- Configuring done (0.3s)
> -- Generating done (0.0s)
> -- Build files have been written to:
> /home/raphy/tesseract/Examples/builddir
> raphy@raohy:~/tesseract/Examples$
> raphy@raohy:~/tesseract/Examples$ cmake --build builddir/
> [ 50%] Building CXX object
> CMakeFiles/BasicExample.dir/BasicExample.cpp.o
> [100%] Linking CXX executable BasicExample
> /usr/bin/ld: /usr/local/lib/libtesseract.a(baseapi.cpp.o): in function
> `tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*,
> int, tesseract::TessResultRenderer*)::{lambda(char
> const*)#1}::operator()(char const*) const':
> baseapi.cpp:(.text+0x13): undefined reference to `curl_easy_strerror'
> /usr/bin/ld: baseapi.cpp:(.text+0x3b): undefined reference to
> `curl_easy_cleanup'
> /usr/bin/ld: /usr/local/lib/libtesseract.a(baseapi.cpp.o): in function
> `tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*,
> int, tesseract::TessResultRenderer*)':
> baseapi.cpp:(.text+0xad07): undefined reference to `curl_easy_init'
> /usr/bin/ld: baseapi.cpp:(.text+0xad48): undefined reference to
> `curl_easy_setopt'
> /usr/bin/ld: baseapi.cpp:(.text+0xad5d): undefined reference to
> `curl_easy_strerror'
> /usr/bin/ld: baseapi.cpp:(.text+0xad89): undefined reference to
> `curl_easy_cleanup'
> /usr/bin/ld: baseapi.cpp:(.text+0xb26f): undefined reference to
> `curl_easy_setopt'
> /usr/bin/ld: baseapi.cpp:(.text+0xb298): undefined reference to
> `curl_easy_setopt'
> /usr/bin/ld: baseapi.cpp:(.text+0xb2c1): undefined reference to
> `curl_easy_setopt'
> /usr/bin/ld: baseapi.cpp:(.text+0xb2fa): undefined reference to
> `curl_easy_setopt'
> /usr/bin/ld: baseapi.cpp:(.text+0xb324): undefined reference to
> `curl_easy_setopt'
> /usr/bin/ld:
> /usr/local/lib/libtesseract.a(baseapi.cpp.o):baseapi.cpp:(.text+0xb3c9):
> more undefined references to `curl_easy_setopt' follow
> /usr/bin/ld: /usr/local/lib/libtesseract.a(baseapi.cpp.o): in function
> `tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*,
> int, tesseract::TessResultRenderer*)':
> baseapi.cpp:(.text+0xb455): undefined reference to `curl_easy_perform'
> /usr/bin/ld: baseapi.cpp:(.text+0xb6b0): undefined reference to
> `curl_easy_cleanup'
> /usr/bin/ld: /usr/local/lib/libtesseract.a(tessdatamanager.cpp.o): in
> function `tesseract::TessdataManager::LoadArchiveFile(char const*)':
> tessdatamanager.cp

Re: [tesseract-ocr] Re: image_to_string OSD hell

2024-02-13 Thread Zdenko Podobny

Works like a charm: just read and follow documentation carefully:

>tesseract e_I_read_documetation_carefully.png - --psm 10
D
>tesseract d_I_read_documetation_carefully.png - --psm 10
E
>tesseract d-I_read_documetation_carefully.png - --psm 10
D-


Zdenko


st 14. 2. 2024 o 2:14 dev 313153  napísal(a):

> Hello,
> I managed to implement a dynamic parsing to get rid of OSD issues i had.
> However i'm blocking on recognizing single uppercase letter, i tried many
> different configurations for preprocessing but i can't get to find the
> right one, even with PSM set to 10, i don't really know what i could try.
> Any help is appreciated.
>
> Here is code snippet for testing with pictures attached :
> import cv2
> import os
> import pytesseract
> import numpy as np
>
> pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\t
> esseract.exe'
>
> for pic in ["e.png","d-.png","d.png"]:
> img=cv2.imread(pic)
>
> #Preprocessing
> img = cv2.resize(img, (70, 90), interpolation=cv2.INTER_NEAREST)
> norm_img = np.zeros((img.shape[0], img.shape[1]))
> img = cv2.normalize(img, norm_img, 0, 255, cv2.NORM_MINMAX)
> img = cv2.fastNlMeansDenoisingColored(img, None, 10, 10, 7, 15)
> img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
> img = cv2.bitwise_not(img)
> img = cv2.threshold(img,127,255,cv2.THRESH_BINARY) [1]
> cv2.imwrite("processed-"+pic, img)
>
> # Tesseract OCR
> text = pytesseract.image_to_string(img, lang='eng', config='-c
> tessedit_char_whitelist=\\ ABCDEF+- tessedit_char_blacklist=\\=!,*%^$°:.
> --psm 10 -oem 3')
> print(str(text).replace("\n", " "))
>
>
> Le mercredi 7 février 2024 à 06:39:37 UTC+1, dev 313153 a écrit :
>
>> Hello,
>> I am very new to tesseract, as well as in image processing in general.
>> I have screenshots from which i want to extract text for further
>> processing, i played around with tesseract after checking the Improve
>> Quality URL and was able to extract what i need (most of the time).
>> For example, in attached screenshots, i want to extract names of the
>> stats and the following letter together, but it doesn't always work.
>> Sometime the letter isn't extracted, and sometime it is, but the OSD
>> consider it belongs on an other level or row and it's output ahead or
>> before the stats names when i use image_to_string.
>> I also tried to play with oem and psm settings, without much improvements.
>>
>> I attached some example of image_to_string outputs for different pictures
>> as well as images and the python code i'm using as testing bench.
>>
>> I am getting a bit desesperate, so i consider the following approaches :
>> - training my own dataset for this need, having sufficient data shouldn't
>> be an issue over time but i have zero experience on this kind of thing.
>> - looking for the stats names coordinates, and then cropping the picture
>> around it to make sure tesseract focusses on it and extract it properly
>> (sounds like a chore code wise, but doable i think).
>>
>> Let me know what you think about it or if you have a improvements to
>> suggest.
>> Best Regards,
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/cd13256e-46f1-405a-842b-e2d781d22e4en%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xZxZqhPy73VM-__W%3DaKbwjZMuuNxuT8OOJZ4jjysr%2BXw%40mail.gmail.com.

Re: [tesseract-ocr] Trouble with Apparently Simple Source Image

2024-02-12 Thread Zdenko Podobny

tesseract I_read_docs_carefully_instead_of_a_lot_of_writing.png - --psm 6
$0.081

Zdenko


po 12. 2. 2024 o 18:40 Rob  napísal(a):

> Hello,
>
> I've run into some trouble using Tesseract OCR in a python program doing
> some screen scraping. I can't quite wrap my head around why this one value
> is having so much more trouble than the others on the same page,  with the
> same contrast and font.
>
> This is the image in question:
> It has been scraped from a 1080p resolution screenshot, sliced into
> individual images for the values in a grid, scaled up by 10x, inverted
> (from white-on-black to this), thresholded, and passed to Tesseract. I have
> also tried various Gaussian and median blurs but those seem to just make
> other strings fail more.
>
> I have tried most of the PSM options that make sense, and passed options
> with just numerals, $, comma, and decimal as allow list of characters. I've
> tried all the different interpolations OpenCV has to offer. Tesseract just
> constantly chokes on this value.
>
> It's a little frustrating because the only OCR I've found that works with
> this value is an A9T9 model(I think) through the free api at ocr.space (
> https://ocr.space/ocrapi#ocrengine2 ). Unfortunately there doesn't appear
> to be a way for me to run that locally, and the string seems like it should
> be simple for an OCR read.
>
> Any advice on poking Tesseract in the right way to read this, or some
> fancy filtering I could do to help make the image clearer for it?
>
> Thanks!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ae2ae7cd-6cd1-44ef-843e-ef10a35929c6n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xw5JQ7J6atb4WOQN-q%2BrEGMeQbUv9OvfMG%3DrMQr0fgig%40mail.gmail.com.

Re: [tesseract-ocr] Make russian_with_accent traineddata file

2024-02-06 Thread Zdenko Podobny

You are referring old issue...
You either provide steps to replicate your problem (including input image)
or you have to solve it by yourself.

Zdenko


po 5. 2. 2024 o 9:53 Romain B. (Le Belge) 
napísal(a):

> Hi,
> 
>
> I saw that tesseract make the mistakes of turning russian vowels with
> accents(ò,à,...)(used for educational purposes most of the time) into other
> russian letters, and saw that someone, with the same problem
> , had created trained
> data(if i understood correctly) for russian with accents
> 
>
> The problem is, i can not find a way to make it a traineddata file, to
> test it and later use it in my code. I found the tesstrain
>  git, but was not able to
> make it work with the data found.
>
> I honestly don't know if I am missing something, not understanding
> correctly something, or if we simply don't train data with these types of
> files anymore.
>
> If you got any clue, that would help me a lot.
>
> Thank you!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/201355ba-dafd-49fd-b0a7-3b42fd8175d8n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yBdaLN04F4F5pLgFHfA2WF2Rvx%2B-Qw%2BRvTsHoVE%3DZP5A%40mail.gmail.com.

Re: [tesseract-ocr] Re: I need help to develop image to text extraction

2024-02-06 Thread Zdenko Podobny

Did you read the tesseract documentation?
Do you understand it?


Zdenko


ut 6. 2. 2024 o 12:38 Santhiya C  napísal(a):

> How do i fix this issue using  training tesseract ocr custom data
>
> On Tuesday 6 February 2024 at 12:11:03 UTC+5:30 Santhiya C wrote:
>
>> can you please tell me model and steps
>>
>> On Monday 5 February 2024 at 17:22:10 UTC+5:30 aromal...@gmail.com wrote:
>>
>>> if you are getting started with OCR try some  other  engines  or just
>>> start with some deep learning models
>>> understand the basic working
>>> On Thursday 1 February 2024 at 11:17:14 UTC+5:30 santhi...@gmail.com
>>> wrote:
>>>
 Already i was used above mentioned  steps but i lost the datas

 On Saturday 27 January 2024 at 06:52:54 UTC+5:30 g...@hobbelt.com
 wrote:

> L.S.,
>
> *PDF. OCR. text extraction. best language models? not a lot of success
> yet...*
>
> 🤔
>
> Broad subject.  Learning curve ahead. 🚧 Workflow diagram included
> today.
>
>
> *Tesseract does not live alone*
>
> Tesseract is an engine, which takes an image as input and produces
> text output; several output formats are available. If you are unsure, 
> start
> with HOCR output as that's close to modern HTML and carries almost all 
> info
> tesseract produces during the OCR process.
> If it isn't an image you've got, you need a preprocess (and
> consequently additional tools) to produce images you can feed tesseract.
> tesseract is designed to process a SINGLE IMAGE. (Yes, that means you may
> want to 'merge' its output: postprocessing)
>
> * To complicate matters immediately, tesseract can deal with
> "multipage TIFF" images and can accept multiple images to process via its
> commandline. Keep thinking "one page image in, bunch of text out" and
> you'll be okay until you discover the additional possibilities.*
>
> *Advice Number 1: *get a tesseract executable, invoke it using its
> commandline interface. If you can't build tesseract yourself, Uni Mannheim
> may have binaries for you to download and install. Linuxes often have
> tesseract binaries and mandatory language models available as packages, 
> BUT
> many Linuxes are more or less far behind the curve: latest tesseract
> release as of this writing is 5.3.4:
> https://github.com/tesseract-ocr/tesseract/releases so VERIFY your
> rig has the latest tesseract installed. Older releases are older and
> "previous" for a reason!
>
>
> *Preprocessing is the chorus of this song*
>
> As you say "PDF", you therefor need to convert that thing to *page
> images*. My personal favorite is the Artifex mupdf toolkit, using
> mutool or mudraw / etc. tools from that commandline toolkit to render
> accurate, high-rez page images. Others will favor other means but it all
> ends up doing the same thing: anything, PDFs et al, is to be converted to
> one image per page and fed to tesseract that way. The rendered page images
> MAY require additional *image preprocessing*:
>
>
> *This next bit cannot be stressed enough: *tesseract is designed and
> engineered to work on plain printed book pages, i.e. BLACK TEXT on PLAIN
> WHITE BACKGROUND. As I observe everyone and their granny dumping holiday
> snapshots, favorite CD, LP and fancy colourful book covers straight into
> tesseract and complaining "nothing sensible is coming out" that's because
> you're feeding it a load of dung as far as the engine concerned: it 
> expects
> BLACK TEXT on PLAIN WHITE BACKGROUND like a regular dull printed page in a
> BOOK, so anything with nature backgrounds, colourful architectural
> backgrounds and such is begging for a disaster. And I only emphasize with
> the grannies.This is why
> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html is
> mentioned almost every week in this mailing list, for example. It's very
> important, but you'll need more...
>
>
> The take-away? You'll need additional tools for image preprocessing
> until you can produce greyscale or B&W images that look almost as if these
> were plain old boring book pages: no or very little fancy stuff, black 
> text
> (anti-aliased or not), white background.
> Bonus points for you when your preprocess removes non-text image
> components, e.g. photographs, in the page image: it can only confuse the
> OCR engine so when you strive for perfection, that's one more bit to deal
> with BEFORE you feed it into tesseract and wait expectantly... (Besides,
> tesseract will have less discovery to do so it'll be faster too. Of little
> importance, relatively speaking, but there you have it.)
> As also mentioned at
> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html : tools
> of interest re image processing are leptonica (par

Re: [tesseract-ocr] OCR of free hand photo of book

2024-01-31 Thread Zdenko Podobny

Tesseract is OCR engine and the user is responsible for preprocessing  -
see the documentation.
IMO there is already app (using tesseract) for what you try to do: Text
Fairy [1]

[1] https://play.google.com/store/apps/details?id=com.renard.ocr&hl=en

Zdenko


st 31. 1. 2024 o 2:00 Borneq  napísal(a):

> First I test tesseract on file generated as flat image.
> I generate Lorem Ipsum text:
>
> 5 paragraphs, 452 words 2978 bytes, 24 lines + 4 blank lines, maximal line
> len in my editor was 135 chars.
>
> Result: 100% accurate but two full stop marks, fantastic.
>
> Next, I rotate image. Only 0.7 degree caused a lot of confusion and minor
> rotation 0.1-0.6 degree - treat some m as n.
>
> In my book photo images are often rotate up to 3.5 degree.
> Worse, text is transformed into curve lines of text like F-distribution
>
> ("What function looks like the edge of a paper book sideways? on
> math.stackexchange.com)
>
> how to work with real photos of books, it is possible as option or thing
> that is missing in tesseract ?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/9ac3343e-df3c-432e-8066-af21a20eda1cn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wdJtDmAiBLstMRU2CVe_ZL2RiMeZH5wk%3DXFW-crK16yw%40mail.gmail.com.

Re: [tesseract-ocr] tesseract is reading passport mrz text from image incorrectly, its identifying <<<<<<<< as kkkk or cccc

2024-01-27 Thread Zdenko Podobny

Well in this case it works without image processing ;-)

Anyway mrz is not "official" Tesseract training and there are people who
play with it, so it will take some time to search and dig
their findings/experience/expertise

Zdenko


so 27. 1. 2024 o 12:02 sara waheed  napísal(a):

> if I didn't research how would I know Tesseract needs image processing? I
> am new to OCR and in the learning phase please be kind and help thanks :)
>
> On Saturday, January 27, 2024 at 3:26:40 PM UTC+5 zdenop wrote:
>
>> What about reading docs and a little bit googling?
>>
>> tesseract two-page-passport-mrz-detected.jpeg - --psm 6 -l mrz
>>
>> IDAUT1999<6<<<
>> 7109094F1112315AUT<<<6
>> MUSTERFRAU<>
>>
>> Zdenko
>>
>>
>> so 27. 1. 2024 o 11:19 sara waheed  napísal(a):
>>
>>> I am trying to read the passport mrz string from the image i am using
>>> Tesseract and OpenCV for image processing i have tried three different ways
>>>  none of them worked
>>>
>>> **Attempt 1**
>>> I have this image  when i do ocr on it teseract read as
>>>
>>> IDAUT1999<6<<<
>>> 7109094F1112315AUT<>> MUSTERFRAU<>>
>>> which is incorrect it treats <<< as x or c or k when I use the
>>> `mrz-java` library to read the details from the string it gives the
>>> following error
>>>
>>> [error] Error parsing MRZ string: Failed to parse MRZ MRTD_TD1
>>> IDAUT1999<6<<<
>>> [error] 7109094F1112315AUT<>> [error] MUSTERFRAU<>> [error]  at 24-25,1: Invalid character in MRZ record: x
>>>
>>> **Attempt 2**
>>>
>>> then I converted the image to grayscale and binarized it using `OpenCV`
>>> Here is the below code
>>>
>>> val roiImagePath =
>>> "src/main/resources/ocr/passport/two-page-passport-mrz-detected.jpeg"
>>>
>>> val grayScaleROI = new Mat()
>>>   val roiImage = Imgcodecs.imread(roiImagePath)
>>>   Imgproc.cvtColor(roiImage, grayScaleROI,
>>> Imgproc.COLOR_BGR2GRAY)
>>>   val roiGaryImagePath =
>>> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-gray.jpeg"
>>>
>>>   Imgcodecs.imwrite(roiGaryImagePath, grayScaleROI)
>>>   val binary = new Mat()
>>>   Imgproc.adaptiveThreshold(grayScaleROI, binary, 255,
>>> Imgproc.ADAPTIVE_THRESH_MEAN_C, Imgproc.THRESH_BINARY , 15, 25)
>>>   val roiBinaryImagePath =
>>> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-binary.jpeg"
>>>   Imgcodecs.imwrite(roiBinaryImagePath, binary)
>>>
>>>  val tesseract = new Tesseract()
>>>   tesseract.setDatapath("/usr/share/tesseract-ocr/4.00/tessdata")
>>>   tesseract.setVariable("user_defined_dpi", "600")
>>>   val result = tesseract.doOCR(new File(roiBinaryImagePath))
>>>   val mrzStr = result.replace(" ", "")
>>>   println(s"two page passport mrz string is: "+mrzStr)
>>>
>>> it created the following binary image
>>>
>>> and the code output is
>>> tesseract reads mrz string from the binary image as
>>>
>>> IDAUT1DODD999>> 7AD9D9GF1TEZSISAUTKEKG
>>> MUSTERFRAUSKISOLDEKKK
>>> and `mrz-java` reads the string and generates the following error
>>>
>>> [error] Error parsing MRZ string: Failed to parse MRZ null
>>> IDAUT1DODD999>> [error] 7AD9D9GF1TEZSISAUTKEKG
>>> [error] MUSTERFRAUSKISOLDEKKK
>>> [error]  at 0-0,0: Different row lengths: 0: 29 and 1: 30
>>>
>>> **Attempt 3**
>>>
>>> then I resized the image
>>>
>>> Val width = 1000 // Increase width proportionately (adjust based on
>>> your needs)
>>>   val height = (width * binary.rows()) / binary.cols() // Maintain
>>> aspect ratio
>>>
>>>   val resizedRoiImage = new Mat()
>>>   Imgproc.resize(binary, resizedRoiImage, new Size(width, height),
>>> 0.0, 0.0, Imgproc.INTER_NEAREST)
>>>
>>>   val resizedImageROIPath =
>>>  
>>> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-binary-resized_image.jpg"
>>>   Imgcodecs.imwrite(resizedImageROIPath, resizedRoiImage)
>>>
>>> mrz string read by Tesseract
>>>
>>> TOAUTIIISKhcceddce
>>> FIOPOSAFIFESSISAUTReececeececs
>>> MUSTERFRAUCCKISOLDECKdcddd
>>>
>>> and the error is
>>>
>>> [info] 15:54:04.200 633 [main] MrzParser INFO - Check digit
>>> verification failed for document number: expected 0 but got h
>>> [error] Error parsing MRZ string: Failed to parse MRZ MRTD_TD1
>>> TOAUTIIISKhcceddce
>>> [error] FIOPOSAFIFESSISAUTReececeececs
>>> [error] MUSTERFRAUCCKISOLDECKdcddd
>>> [error]  at 15-16,0: Invalid character in MRZ record: c
>>>
>>>
>>> can anyone please help how I read the text properly also I have tried
>>> one regex to convert c or k back to <<< it did not work either if anyone
>>> can suggest some workaround or any improvement in code please help me with
>>> that thanks
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from

Re: [tesseract-ocr] tesseract is reading passport mrz text from image incorrectly, its identifying <<<<<<<< as kkkk or cccc

2024-01-27 Thread Zdenko Podobny

What about reading docs and a little bit googling?

tesseract two-page-passport-mrz-detected.jpeg - --psm 6 -l mrz

IDAUT1999<6<<<
7109094F1112315AUT<<<6
MUSTERFRAU< napísal(a):

> I am trying to read the passport mrz string from the image i am using
> Tesseract and OpenCV for image processing i have tried three different ways
>  none of them worked
>
> **Attempt 1**
> I have this image  when i do ocr on it teseract read as
>
> IDAUT1999<6<<<
> 7109094F1112315AUT< MUSTERFRAU<
> which is incorrect it treats <<< as x or c or k when I use the `mrz-java`
> library to read the details from the string it gives the following error
>
> [error] Error parsing MRZ string: Failed to parse MRZ MRTD_TD1
> IDAUT1999<6<<<
> [error] 7109094F1112315AUT< [error] MUSTERFRAU< [error]  at 24-25,1: Invalid character in MRZ record: x
>
> **Attempt 2**
>
> then I converted the image to grayscale and binarized it using `OpenCV`
> Here is the below code
>
> val roiImagePath =
> "src/main/resources/ocr/passport/two-page-passport-mrz-detected.jpeg"
>
> val grayScaleROI = new Mat()
>   val roiImage = Imgcodecs.imread(roiImagePath)
>   Imgproc.cvtColor(roiImage, grayScaleROI, Imgproc.COLOR_BGR2GRAY)
>   val roiGaryImagePath =
> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-gray.jpeg"
>
>   Imgcodecs.imwrite(roiGaryImagePath, grayScaleROI)
>   val binary = new Mat()
>   Imgproc.adaptiveThreshold(grayScaleROI, binary, 255,
> Imgproc.ADAPTIVE_THRESH_MEAN_C, Imgproc.THRESH_BINARY , 15, 25)
>   val roiBinaryImagePath =
> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-binary.jpeg"
>   Imgcodecs.imwrite(roiBinaryImagePath, binary)
>
>  val tesseract = new Tesseract()
>   tesseract.setDatapath("/usr/share/tesseract-ocr/4.00/tessdata")
>   tesseract.setVariable("user_defined_dpi", "600")
>   val result = tesseract.doOCR(new File(roiBinaryImagePath))
>   val mrzStr = result.replace(" ", "")
>   println(s"two page passport mrz string is: "+mrzStr)
>
> it created the following binary image
>
> and the code output is
> tesseract reads mrz string from the binary image as
>
> IDAUT1DODD999 7AD9D9GF1TEZSISAUTKEKG
> MUSTERFRAUSKISOLDEKKK
> and `mrz-java` reads the string and generates the following error
>
> [error] Error parsing MRZ string: Failed to parse MRZ null
> IDAUT1DODD999 [error] 7AD9D9GF1TEZSISAUTKEKG
> [error] MUSTERFRAUSKISOLDEKKK
> [error]  at 0-0,0: Different row lengths: 0: 29 and 1: 30
>
> **Attempt 3**
>
> then I resized the image
>
> Val width = 1000 // Increase width proportionately (adjust based on
> your needs)
>   val height = (width * binary.rows()) / binary.cols() // Maintain
> aspect ratio
>
>   val resizedRoiImage = new Mat()
>   Imgproc.resize(binary, resizedRoiImage, new Size(width, height),
> 0.0, 0.0, Imgproc.INTER_NEAREST)
>
>   val resizedImageROIPath =
>  
> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-binary-resized_image.jpg"
>   Imgcodecs.imwrite(resizedImageROIPath, resizedRoiImage)
>
> mrz string read by Tesseract
>
> TOAUTIIISKhcceddce
> FIOPOSAFIFESSISAUTReececeececs
> MUSTERFRAUCCKISOLDECKdcddd
>
> and the error is
>
> [info] 15:54:04.200 633 [main] MrzParser INFO - Check digit
> verification failed for document number: expected 0 but got h
> [error] Error parsing MRZ string: Failed to parse MRZ MRTD_TD1
> TOAUTIIISKhcceddce
> [error] FIOPOSAFIFESSISAUTReececeececs
> [error] MUSTERFRAUCCKISOLDECKdcddd
> [error]  at 15-16,0: Invalid character in MRZ record: c
>
>
> can anyone please help how I read the text properly also I have tried one
> regex to convert c or k back to <<< it did not work either if anyone can
> suggest some workaround or any improvement in code please help me with that
> thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/440788ab-1d76-4612-a4b5-a1a4c2cd09a5n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xbT8jWSOveXeSRCHE_Vr%2Bx%3DoXo0k4yuqtL_MUH%2BN6rRA%40mail.gmail.com.

Re: [tesseract-ocr] Unable to build on macOS Mojave (10.14)

2024-01-23 Thread Zdenko Podobny

You can install tesseract without building & installing training tools.

Anyway requesting tesseract as decendacy of  ffmpeg makes no sense for me
(and it is not listed at https://trac.ffmpeg.org/wiki/CompilationGuide/macOS).
So something in homebrew should be fixed/setup correctly.


Zdenko


ut 23. 1. 2024 o 11:55 Benoît Mars  napísal(a):

> Yes, that's why I am stucked during the building. The dependency is added
> by homebrew when I am trying to update ffmpeg, and I cannot find the way to
> avoid it…
>
> Le jeudi 18 janvier 2024 à 21:31:11 UTC+1, tfmo...@gmail.com a écrit :
>
>> On Friday, January 12, 2024 at 8:44:28 AM UTC-5 zdenop wrote:
>>
>> I also wonder why the tesseract build process did not stop during
>> configuration process as there should be tests for C++17...
>>
>>
>> Mojave has the C++17 compiler support, so the compiler check succeeds,
>> but the stdlib implementation is missing some things required for full
>> C++17 support (like the filesystem library).
>>
>> Tom
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/120ed7ed-f56b-4267-9bd0-1ad8c6853d91n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yPTEmXzmQE7Lc3KZRoRDDAqtqScmoR-dbf9mzarKkgVg%40mail.gmail.com.

Re: [tesseract-ocr] Miss lots of words in the detection

2024-01-22 Thread Zdenko Podobny

Hi,

The most critical part is this:
https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html, but I need to
stress: tesseract is OCR *engine *not OCR *suite*.
Unless your input page is not a book page scan without a
difficult structure, you need to do your part like image processing and
document segmentation (detection of text block).

This is the reason why you get "unsatisfactory" results if you send
complicated images with non uniform texts, with graphics etc.
However if you will use only text part of the image for recognition you can
get very good results.

Best regards,

Zdenko


po 22. 1. 2024 o 19:42 L ht  napísal(a):

> Hi Zdenko,
>
> Thanks for your response.
> I read the Tesseract User Manual (https://tesseract-ocr.github.io/tessdoc/),
> but not read the code
>
> I tried both tessdata_best and tessdata, tried different parameters of
> --psm, still can not get more detections.
>
> To provide some context, when I applied Tesseract to the entire image, it
> managed to identify only a few words, such as "Log in," "Username,"
> "Password," and "Cancel," primarily within the central, well-lit portion.
> However, when I cropped the image to retain either the upper or left
> portions, Tesseract exhibited improved performance, successfully detecting
> numerous words in those respective areas.
>
> Best,
> Haitao
>
> On Sun, Jan 21, 2024 at 3:02 AM Zdenko Podobny  wrote:
>
>> Did you read the documentation or did you just set your expectations?
>>
>>
>> Zdenko
>>
>>
>> ne 21. 1. 2024 o 12:00 L ht  napísal(a):
>>
>>> I am new to use tesseract. I found tesseract does not work as expected.
>>> I attach one example.
>>>
>>> tesseract 5.3.2
>>> tesseract 272525030292764523137280353496213864766.png - -l eng --psm 3
>>> quiet
>>> can only detect those words
>>> "Log in
>>> Username
>>> Password
>>> Cancel"
>>>
>>> I submit this picture to several online pic->txt converters. they work
>>> well, detecting most of the text in the pic.
>>> For example, https://www.imagetotext.info/ it claims that it use
>>> tesseract
>>>
>>> I am not sure if I use tesseract correctly.
>>> Does another can help test what's your detection result of this
>>> picture?
>>> Thanks
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/e95fa7c6-7afb-4a08-8b11-a63a024c3c9bn%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/e95fa7c6-7afb-4a08-8b11-a63a024c3c9bn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y9abBL2T7wEiWB9KDAuOqkVY4DZcuqpc7u9PbY3jxfEg%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y9abBL2T7wEiWB9KDAuOqkVY4DZcuqpc7u9PbY3jxfEg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CANmU3o_UAK6Qi_4SGxDQeRdRYWaHbdpQh%3DbHW-VM_S3yhJaXzQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CANmU3o_UAK6Qi_4SGxDQeRdRYWaHbdpQh%3DbHW-VM_S3yhJaXzQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zc4pyY%2BGJfVGrJ-yDMTo1tLn9DA502FJeB_V%3DLKi5p%2BQ%40mail.gmail.com.

Re: [tesseract-ocr] Miss lots of words in the detection

2024-01-21 Thread Zdenko Podobny

Did you read the documentation or did you just set your expectations?


Zdenko


ne 21. 1. 2024 o 12:00 L ht  napísal(a):

> I am new to use tesseract. I found tesseract does not work as expected. I
> attach one example.
>
> tesseract 5.3.2
> tesseract 272525030292764523137280353496213864766.png - -l eng --psm 3
> quiet
> can only detect those words
> "Log in
> Username
> Password
> Cancel"
>
> I submit this picture to several online pic->txt converters. they work
> well, detecting most of the text in the pic.
> For example, https://www.imagetotext.info/ it claims that it use
> tesseract
>
> I am not sure if I use tesseract correctly.
> Does another can help test what's your detection result of this picture?
> Thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/e95fa7c6-7afb-4a08-8b11-a63a024c3c9bn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y9abBL2T7wEiWB9KDAuOqkVY4DZcuqpc7u9PbY3jxfEg%40mail.gmail.com.

Re: [tesseract-ocr] Re: Unable to get Orientation with node-tesseract, Warning, detects only orientation with -l eng Error, OSD requires a model for the legacy engine

2024-01-13 Thread Zdenko Podobny

You do not need to rename  traineddata. You can move them to tessdata
subdirectory e.g. tessdata/fast, tessdata/best and then use it at "-l
best/eng" or "-l fast/eng"

Zdenko


so 13. 1. 2024 o 3:38 Oliver Saintilien 
napísal(a):

> Oh right, for those facing a similar issue, what I did was
> 1. relpace the eng.traineddata file with the  eng.traineddata found here 
> tesseract-ocr/tessdata:
> Trained models with fast variant of the "best" LSTM models + legacy models
> (github.com)  I
> didn't delete the original file but renamed it.
> 2. Test the orientation command directly with tesseract in the terminal
> like so  tesseract
> "C:\Users\osain\OneDrive\Desktop\2000\Document_20240110_0001.jpg" stdout
> --psm 0 --oem 0
>
> If this command works in the terminal then it will work in the node
> wrapper version. Here is how I called it.
> tesseract.recognize(path, {
>   oem: 0,
>   psm: 0,
>   lang: "eng"
> })
> .then((data) => {
>   return data
> })
> .catch((error) => {
>   console.log(error.message)
>   })
>
>
> On Friday, January 12, 2024 at 8:21:03 PM UTC-5 Oliver Saintilien wrote:
>
>> Great it works like a charm now, thanks very much for your help.
>>
>> On Friday, January 12, 2024 at 10:42:05 AM UTC-5 g...@hobbelt.com wrote:
>>
>>> On Fri, 12 Jan 2024, 14:08 Oliver Saintilien, 
>>> wrote:
>>>
 Something else I tried was this
 const tesseract = require("node-tesseract-ocr")

>>> tesseract
   .recognize(`C:\\Users\\osain\\OneDrive\\Desktop\\1992 Spring\\
 Document_20240109_0014.jpg`, {
 lang: "eng",
 oem: 1,
 psm: 0,

>>> "tessdata-dir": "C:\\Program Files\\Tesseract-OCR\\tessdata"
   })

 Thats when I get the error about the Tessdata env var. I have pasted it
 below:

 Command failed: tesseract "C:\Users\osain\OneDrive\Desktop\1992
 Spring\Document_20240109_0014.jpg" stdout -l eng --oem 1 --psm 3
 --tessdata-dir C:\Program Files\Tesseract-OCR\tessdata
 Error opening data file C:\Program/eng.traineddata
 Please make sure the TESSDATA_PREFIX environment variable is set to
 your "tessdata" directory.

>>>
>>> Adding to Zdenko's answer: what you need to do is fix / patch
>>> node-tesseract-ocr (or file a bug report there and see if someone else does
>>> it for you; since this is open source I suggest fork+fix+pullreq at
>>> node-tesseract-ocr instead ;-) ) where it then correctly converts paths
>>> with spaces as specified in js config struct to operating system dependent
>>> correctly escaped commandline arguments for tesseract executable that is
>>> invoked by node-tesseract-ocr.
>>> Quickest fix would be to wrap the --tessdata-dir path argument in double
>>> quotes, which fixes most/your path issues on mswindows (as long as the path
>>> itself is not adversarial, containing dquote of it's own).
>>>
>>> In other words: currently node-tesseract-ocr produces this commandline,
>>> as reported by you:
>>>
>>> tesseract "C:\Users\osain\OneDrive\Desktop\1992
>>> Spring\Document_20240109_0014.jpg" stdout -l eng --oem 1 --psm 3
>>> --tessdata-dir C:\Program Files\Tesseract-OCR\tessdata
>>>
>>> which is interpreted like this (extra newlines added to show the
>>> arguments separated):
>>>
>>> tesseract
>>>  "C:\Users\osain\OneDrive\Desktop\1992 Spring\Document_20240109_0014.jpg"
>>>  stdout
>>>  -l eng
>>>  --oem 1
>>>  --psm 3
>>>  --tessdata-dir C:\Program
>>> Files\Tesseract-OCR\tessdata
>>>
>>> so tesseract receives this and gets a damaged path PLUS a surplus
>>> argument it apparently ignored: "Files\Tesseract-OCR\tessdata".
>>>
>>> Would SHOULD have been generated by node-tesseract-ocr is this (with
>>> extra newlines again):
>>>
>>>
>>> tesseract
>>>  "C:\Users\osain\OneDrive\Desktop\1992 Spring\Document_20240109_0014.jpg"
>>>  stdout
>>>  -l eng
>>>  --oem 1
>>>  --psm 3
>>>  --tessdata-dir "C:\Program Files\Tesseract-OCR\tessdata"
>>>
>>> as was intended in the js code.
>>>
>>>
>>> HTH,
>>>
>>> Ger
>>>
>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/77f1b6af-6cea-4294-b4fd-5a2ec03ded23n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zNdUKh9eTneHNp5nEJs%2BYOuq-GVvPiMvmkBiQP_hOYBA%40mail.gmail.com

Re: [tesseract-ocr] Unable to build on macOS Mojave (10.14)

2024-01-12 Thread Zdenko Podobny

ffmpeg needs tesseract training tools as dependency?
I guess something is misconfigured. On most unix-like  system training
tools are separated and not installed by default
Can you avoid the 'make training' step?

I also wonder why the tesseract build process did not stop during
configuration process as there should be tests for C++17...

Personally I do not think you need training tools (at least I would not
suggest to do tesseract training on such old system)

Zdenko


pi 12. 1. 2024 o 14:29 Benoît Mars  napísal(a):

> Yes, the future looks complicated for this version of macOS, but I do not
> really have the choice… Now the error is with the line #69:
>
> src/training/unicharset_extractor.cpp:69:58: error: 'u8string' is
> unavailable: introduced in macOS 10.15
>
> It prevents me from installing ffmpeg which is really painful…
> Le jeudi 11 janvier 2024 à 21:10:54 UTC+1, zdenop a écrit :
>
>> std::filesystem::path[1] is part of the C++17 standard and Tesseract
>> requires this standard for a long time (4-5 years)[2]. So you suggest
>> reverting this decision?
>>
>>
>> [1] https://en.cppreference.com/w/cpp/filesystem/path
>> [2]
>> https://github.com/search?q=repo%3Atesseract-ocr%2Ftesseract+C%2B%2B17&type=code
>>
>>
>> Zdenko
>>
>>
>> št 11. 1. 2024 o 20:05 Tom Morris  napísal(a):
>>
>>> It looks like unicharset_extractor.cpp is the only place that
>>> std::filesystem::path is used and that was introduced relatively recently:
>>>
>>> https://github.com/tesseract-ocr/tesseract/commit/8a26329623a017277364bc5670266c7e964c8a07?diff=split&w=1
>>> so it probably wouldn't be hard to restore compatibility with older
>>> systems, but you're likely living on borrowed time.
>>>
>>> If you want to try fixing this locally, you could replace
>>> path.extension() with string.compare()
>>>
>>> Tom
>>>
>>> --
>>>
>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>>
>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/f444c87d-f68d-4cd7-b935-df751f89de9bn%40googlegroups.com
>>> 
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/31d01610-3f49-4576-8127-4b0b5bf704b4n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w5wA8bb-Gz1KhmgUTAxCaaWGoBoitkkLzA1uULMC60Jg%40mail.gmail.com.

Re: [tesseract-ocr] Re: Unable to get Orientation with node-tesseract, Warning, detects only orientation with -l eng Error, OSD requires a model for the legacy engine

2024-01-12 Thread Zdenko Podobny

*tesseract executable problem:*

for TESSDATA_PREFIX you use a path with space and you did not not escape it
properly. That is why you get an error about an existing file
("C:\Program/eng.traineddata").
Solutions:
a) use path without speciation characters like space
b) learn how to properly escaped path to environment variables

When you solve this problem you will face the same problem (Error, OSD
requires a model for the legacy engine) as with node-tesseract-ocr (that
seems to take care about handling paths correctly) ;-)
I guess problem is that OSD needs legacy engine while you restrict
tesseract to use only LSTM engine. So you need to fix your option to allow
usage of legacy engine. I am not sure if OSD needs also eng.traineddata
with legacy components, but you will see.

KR,

Zdenko


pi 12. 1. 2024 o 14:08 Oliver Saintilien 
napísal(a):

> Sorry for the confusion, When I do
>
> tesseract
>   .recognize(`C:\\Users\\osain\\OneDrive\\Desktop\\1992 Spring\\
> Document_20240109_0014.jpg`, {
> lang: "eng",
> oem: 1,
> psm: 0,
>
>   })
>
> I get
> Command failed: tesseract "C:\Users\osain\OneDrive\Desktop\1992
> Spring\Document_20240109_0014.jpg" stdout -l eng --oem 1 --psm 0
> Warning, detects only orientation with -l eng
> Error, OSD requires a model for the legacy engine
>
> How do I fix this error? I am using it through this wrapper node-tesseract-ocr
> - npm (npmjs.com) . I
> hear you when you say  make sure  tesseract (outside of wrapper) is
> providing expected results. But thats the thing when I set psm to 0 I
> expect to get back orientation data. However when I set the psm to other
> numbers like 3 or 1 it returns to me the text from an image.
>
> Something else I tried was this
> const tesseract = require("node-tesseract-ocr")
>
> tesseract
>   .recognize(`C:\\Users\\osain\\OneDrive\\Desktop\\1992 Spring\\
> Document_20240109_0014.jpg`, {
> lang: "eng",
> oem: 1,
> psm: 0,
> "tessdata-dir": "C:\\Program Files\\Tesseract-OCR\\tessdata"
>   })
>
> Thats when I get the error about the Tessdata env var. I have pasted it
> below:
>
> Command failed: tesseract "C:\Users\osain\OneDrive\Desktop\1992
> Spring\Document_20240109_0014.jpg" stdout -l eng --oem 1 --psm 3
> --tessdata-dir C:\Program Files\Tesseract-OCR\tessdata
> Error opening data file C:\Program/eng.traineddata
> Please make sure the TESSDATA_PREFIX environment variable is set to your
> "tessdata" directory.
> Failed loading language 'eng'
> Tesseract couldn't load any languages!
> Could not initialize tesseract.
>
> On Friday, January 12, 2024 at 1:11:56 AM UTC-5 zdenop wrote:
>
>> Unfortunately you don't.
>>
>> Instead of showing irrelevant information, make sure  tesseract (outside
>> of wrapper) is providing expected results.
>>
>> You are claiming "I keep getting an error that I have to set the
>> TESSDATA_PREFIX" but your only relevant screenshot (you made it hardly
>> readable) shows that this is not true.
>> Please do not post a screenshot - send relevant logs (text) or copy text
>> from the console.
>>
>> Zdenko
>>
>>
>> pi 12. 1. 2024 o 4:59 Oliver Saintilien  napísal(a):
>>
>>>
>>> When  I do
>>> ```js
>>> tesseract
>>>   .recognize(`C:\\Users\\osain\\OneDrive\\Desktop\\1992 Spring\\
>>> Document_20240109_0014.jpg`, {
>>> lang: "eng",
>>> oem: 1,
>>> psm: 0,
>>>
>>>   })
>>>   .then((text) => {
>>>
>>> console.log(text )
>>>
>>>   }) ```
>>> I was expecting to get some orientation info on the image, like if its,
>>> sideways, upsidedown, etc, but instead it gives me the error you see in my
>>> subject, and in the screenshot.  Changing the psm to 3 extracts the text
>>> perfect! but when I change it to 0 I get that error. I got the number code
>>> for psm from here  Improving the quality of the output | tessdoc
>>> (tesseract-ocr.github.io)
>>> 
>>> On Thursday, January 11, 2024 at 1:25:53 PM UTC-5 Oliver Saintilien
>>> wrote:
>>>
 So I keep getting an error that I have to set the TESSDATA_PREFIX env
 var which I did do, both in the User Vars and System Var. However after
 doing that I get another error. I attached screenshots to make my setup and
 issuse as clear as possible. Im using node-tesseract-ocr - npm
 (npmjs.com) 

 [image: Screenshot 2024-01-11 131619.png][image: Screenshot 2024-01-11
 131802.png]

 [image: Screenshot 2024-01-11 131330.png]
>>>
>>> --
>>>
>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>>
>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/4e67e75a-b3c6-4f95-a168-eb8d9e50d6e3n%40googlegroups.com
>>>

Re: [tesseract-ocr] Re: Unable to get Orientation with node-tesseract, Warning, detects only orientation with -l eng Error, OSD requires a model for the legacy engine

2024-01-11 Thread Zdenko Podobny

Unfortunately you don't.

Instead of showing irrelevant information, make sure  tesseract (outside of
wrapper) is providing expected results.

You are claiming "I keep getting an error that I have to set the
TESSDATA_PREFIX" but your only relevant screenshot (you made it hardly
readable) shows that this is not true.
Please do not post a screenshot - send relevant logs (text) or copy text
from the console.

Zdenko


pi 12. 1. 2024 o 4:59 Oliver Saintilien 
napísal(a):

>
> When  I do
> ```js
> tesseract
>   .recognize(`C:\\Users\\osain\\OneDrive\\Desktop\\1992 Spring\\
> Document_20240109_0014.jpg`, {
> lang: "eng",
> oem: 1,
> psm: 0,
>
>   })
>   .then((text) => {
>
> console.log(text )
>
>   }) ```
> I was expecting to get some orientation info on the image, like if its,
> sideways, upsidedown, etc, but instead it gives me the error you see in my
> subject, and in the screenshot.  Changing the psm to 3 extracts the text
> perfect! but when I change it to 0 I get that error. I got the number code
> for psm from here  Improving the quality of the output | tessdoc
> (tesseract-ocr.github.io)
> 
> On Thursday, January 11, 2024 at 1:25:53 PM UTC-5 Oliver Saintilien wrote:
>
>> So I keep getting an error that I have to set the TESSDATA_PREFIX env var
>> which I did do, both in the User Vars and System Var. However after doing
>> that I get another error. I attached screenshots to make my setup and
>> issuse as clear as possible. Im using node-tesseract-ocr - npm
>> (npmjs.com) 
>>
>> [image: Screenshot 2024-01-11 131619.png][image: Screenshot 2024-01-11
>> 131802.png]
>>
>> [image: Screenshot 2024-01-11 131330.png]
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/4e67e75a-b3c6-4f95-a168-eb8d9e50d6e3n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wQ0uCHGs_veKznfT%3DRmDWQHjy%2B3BS5Wy3X1PPUpBw%2B5w%40mail.gmail.com.

Re: [tesseract-ocr] Unable to build on macOS Mojave (10.14)

2024-01-11 Thread Zdenko Podobny

std::filesystem::path[1] is part of the C++17 standard and Tesseract
requires this standard for a long time (4-5 years)[2]. So you suggest
reverting this decision?


[1] https://en.cppreference.com/w/cpp/filesystem/path
[2]
https://github.com/search?q=repo%3Atesseract-ocr%2Ftesseract+C%2B%2B17&type=code


Zdenko


št 11. 1. 2024 o 20:05 Tom Morris  napísal(a):

> It looks like unicharset_extractor.cpp is the only place that
> std::filesystem::path is used and that was introduced relatively recently:
>
> https://github.com/tesseract-ocr/tesseract/commit/8a26329623a017277364bc5670266c7e964c8a07?diff=split&w=1
> so it probably wouldn't be hard to restore compatibility with older
> systems, but you're likely living on borrowed time.
>
> If you want to try fixing this locally, you could replace path.extension()
> with string.compare()
>
> Tom
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/f444c87d-f68d-4cd7-b935-df751f89de9bn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w3jXSkuUvEgXd5dF9TpkeomorYjiB4ohQN%2BnHOE1muwg%40mail.gmail.com.

Re: [tesseract-ocr] Using tesseract in node

2024-01-10 Thread Zdenko Podobny

Tesseract is not trained for handwritten text.

Zdenko


st 10. 1. 2024 o 7:02 Sandeep Shakya 
napísal(a):

> import tesseract from "node-tesseract-ocr";
> import fs from "fs";
>
> const img = fs.readFileSync("./src/extract_user_input/2.jpg");
>
> const config = {
>   lang: "eng",
>   // oem: 1,
>   psm: 4,
> };
>
> tesseract
>   .recognize(img, config)
>   .then((text) => {
> console.log("Result:", text);
>   })
>   .catch((error) => {
> console.log("err");
> console.log(error);
>   });
>
> I want to extract the hand writtenn text (for ex. point no.24 )
> from the below image. which config should i use (or any other
> changes needed to be done in image)??
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6cf4dc50-ecbd-4110-a6b8-6c29335bee79n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wMjYPByCP4WvVsYvCx8LKs%2BVH5GHAsBEhf6pJ7ABN-Bg%40mail.gmail.com.

Re: [tesseract-ocr] Unable to build on macOS Mojave (10.14)

2024-01-10 Thread Zdenko Podobny

... I am trying to upgrade tesseract from 5.2.0 to 5.3.3 on macOS 10.14.6
... is unavailable: introduced in macOS 10.15

Upgrade?

Zdenko


st 10. 1. 2024 o 17:29 Benoît Mars  napísal(a):

> I am trying to upgrade tesseract from 5.2.0 to 5.3.3 on macOS 10.14.6 via
> Homebrew (v 4.2.3). The build fails during the "make training" process with
> the following errors:
>
> Last 15 lines from
> /Users/benoitmars/Library/Logs/Homebrew/tesseract/03.make:
> src/training/unicharset_extractor.cpp:74:33: error: '~path' is
> unavailable: introduced in macOS 10.15
> if (filePath.extension() == ".box") {
> ^
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/filesystem:791:3:
> note: '~path' has been explicitly marked unavailable here
>   ~path() = default;
>   ^
> src/training/unicharset_extractor.cpp:74:30: error: 'operator==' is
> unavailable: introduced in macOS 10.15
> if (filePath.extension() == ".box") {
>  ^
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/filesystem:1156:41:
> note: 'operator==' has been explicitly marked unavailable here
>   friend _LIBCPP_INLINE_VISIBILITY bool operator==(const path& __lhs,
> const path& __rhs) noexcept {
> ^
> 9 errors generated.
>
> Any idea to fix this? Thanks!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/dd8aca79-a516-4369-8039-b7dfbd78e339n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zzK3Yjj7z5ExCwcSc7MK_1hFo5Yew8eagN5SOKG9Q9eg%40mail.gmail.com.

Re: [tesseract-ocr] Errors With Downloading Tesseract v4.1.1

2024-01-08 Thread Zdenko Podobny

Please provide full log of whole process (starting from autogen.sh)

Zdenko


ut 9. 1. 2024 o 6:50 Evaan Ahmed  napísal(a):

> Hey y'all!
>
> On my local machine (a Mac), I'm trying to download the version of
> Tesseract that is available on Google Colab. This is version 4.1.1. I
> downloaded the files from
> https://github.com/tesseract-ocr/tesseract/releases/tag/4.1.1 and tried
> to run the following commands:
>cd tesseract
> ./autogen.sh
> ./configure
> make
> sudo make install
> sudo ldconfig
> (Instructions copied from
> https://github.com/tesseract-ocr/tessdoc/blob/main/Compiling-%E2%80%93-GitInstallation.md
> )
>
> At the "make" stage, however, I see a lot of errors. First, there was this
> error 20 times: "In file included from
> /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/c++/v1/cstddef:42:
> ../../version:1:1: error: expected unqualified-id
> 4.1.1".
> ChatGPT recommended removing the version file; I did that and this error
> went away.
> But then new errors started coming up (most of them around "member access
> into incomplete type 'Pixa' and 'Box' ").
>
> I was expecting the installation of a popular software to not have these
> faults. Am I doing something wrong?
>
> If we can't fix the bugs, can someone email me their tesseract 4.1.1
> executable?
>
> Thank You.
>
> Best,
> Evaan.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/df95-b337-4e49-ab5b-8a768c876731n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xvV9AjAWdvhvEZh9%2BE1JoTv32UvKzwKx27ePgiutq%2ByA%40mail.gmail.com.

Re: [tesseract-ocr] Phantom characters

2024-01-01 Thread Zdenko Podobny

post:

   1. Original image (without preprocessing)
   2. + image used for OCR (preprocessed)
   3. + output from tesseract executable (not tesseract wrappers) and used
   parameters/option

Otherwise, nobody can reproduce the problem and therefore suggest a
solution.

Zdenko


ne 31. 12. 2023 o 10:53 Jason Shepherd 
napísal(a):

> I'm using pytesseract and tesseract v5.3.3 to read some text from some
> images and I sometimes get these weird phantom characters. I've tried to
> do some image preprocessing like increasing the image size, erosion,
> thresholding, etc, but nothing seems to get rid of this random character
> that's spawing from nothing. Attached are two image examples (left side
> is processed, right is original with rect bounding boxes drawn), The blue
> rectangle to right of "KB PNG" is a '_' being detected even tho that
> space is completely blank. Any ideas on getting rid of this?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/8800b99f-b92d-4dbf-83b8-d1d3da9c2bf4n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wZqRPS17_TXa05XyvMJ41h-4FuFNS9egUcm0c%2Be2Oh4A%40mail.gmail.com.

Re: [tesseract-ocr] Failed to load list of training filenames from data/foo/list.train

2024-01-01 Thread Zdenko Podobny

Follow https://github.com/tesseract-ocr/tesstrain/blob/main/README.md
Tesseract OCR 3.05.02 was released 6 years ago...

Zdenko


so 30. 12. 2023 o 18:24 Omar Samir  napísal(a):

> I was trying to train Tesseract-OCR on the ocrd-testset.zip in the README,
> and I get this error above in the subject
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/08f0c9a3-30fb-4239-a9bf-4acd75e0e9e4n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8woob1ANowf47Jvh2yXGOwVe1%2BVDL6ET%3Da1pr0p21zM0g%40mail.gmail.com.

Re: [tesseract-ocr] tessract usage

2024-01-01 Thread Zdenko Podobny

Did you check license?

https://github.com/tesseract-ocr/tesseract/blob/main/LICENSE

Zdenko


st 27. 12. 2023 o 17:56 Ajay Bhosle  napísal(a):

> Can i use tesseract to extract text from pdf for commercial use?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6ee65343-f8cf-4393-bc22-68ba0d5445a8n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8ygRVtyss53N4MLH%2B5mF0STm20x49G6dfJJXs2ahYaHCQ%40mail.gmail.com.

Re: [tesseract-ocr] inaccuracy in plane text

2023-12-25 Thread Zdenko Podobny

I put it to documentation because I had the same problem as you (to find
it) :-)

Zdenko


po 25. 12. 2023 o 4:40 Ger Hobbelt  napísal(a):

>
>
> On Sat, 23 Dec 2023, 19:16 Zdenko Podobny,  wrote:
>
>>  tesseract expects black text (lettering) on a white background: that's
>> what is has been trained on and that's what will work best. Hence: try to
>> convert anything to look like that before feeding it to Tesseract.
>>
>>
>> This is not needed (in all cases ;-) ): tesseract inverts a image by
>> itself
>> <https://github.com/tesseract-ocr/tesseract/blob/ea0b245c43ee850f1e571d469b313b90d58d8b13/src/lstm/lstmrecognizer.cpp#L349-L378>
>>  for
>> LSTM and uses OCR results with the best confidence. Practically it does not
>> work for 100%. But if somebody cares about speed the best way is to use a
>> binarized image with a white background and black text + usage of
>> parameter tessedit_do_invert=0 (or new parameter
>> <https://github.com/tesseract-ocr/tesseract/blob/ea0b245c43ee850f1e571d469b313b90d58d8b13/ChangeLog#L74-L75>
>>  invert_threshold=0.0)
>>
>
> Oh yes, absolutely, but I've seen images where the lstm "recognized"
> gobbledygook with a reported score /above/ 0.7 and thus skipping that
> "let's see what the inverted clip gives us" code chunk. While I'm usually
> fond of some extra detail like invert_threshold, there's way too many
> novices running into trouble who are probably better off not knowing about
> this option 😉 so they will put more effort into getting their images to
> look like white paper (background) with black print on it, before they feed
> it to tesseract and expect any kind of possibly decent result. Or so I
> hope.😅
>
>
>> (Someone did in depth research about this many years ago, published on
>> this list including charts, but i can't find the link within 60 seconds.
>> Lazy me, sorry)
>>
>>
>> "Willus Dotkom" - link is part of most ignored tesseract part
>> (documentation) - see
>> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md#rescaling
>> :-)
>>
>
> Right on, bingo!
>
> 😰And I didn't check that page for it, while I did run a mailing list
> search. Whoops!🤦
>
> Seriously though: thanks for mentioning that link again. Very useful info
> that has been, many times over.
>
> Merry Christmas,
>
> Ger
>
>
>
>
>>
>> Zdenko
>>
>>
>> pi 22. 12. 2023 o 19:51 Ger Hobbelt  napísal(a):
>>
>>> Couple of things to check/test:
>>>
>>> - tesseract expects black text (lettering) on white background: that's
>>> what is has been trained on and that's what will work best. Hence: try to
>>> convert anything to look like that before feeding it to tesseract.
>>>
>>> - tesseract was trained on text, if I recall correctly, that's 11pt.
>>> Which is what you'll read in several places on the internet and is useless
>>> info as-is because pt (points) are a printer/publisher unit of measure for
>>> *paper* print, not computer images.
>>>  However, this translates to 30-50px total character height, including
>>> stem height for glyphs such as p,q,b and d, so the rule of thumb becomes:
>>> try to make your text line fit in 30 to 50 pixels height, for possibly best
>>> results. (Someone did in depth research about this many years ago,
>>> published on this list including charts, but i can't find the link within
>>> 60 seconds. Lazy me, sorry)
>>>
>>> - tesseract uses dictionary-like behaviour to help guestimate what it is
>>> actually seeing (lstm can be argued to behave like a Markov chain, old
>>> skool v3 OCR mode uses dictionaries) and that means tesseract very much
>>> likes to see human language "words". Stuff like, if you just saw a q, and
>>> your language in any Indo-European, you can bet your bottom the next glyph
>>> will be 'u'. As in: "QUestion".
>>>
>>> Yours, however, is a semi-random letter matrix for a puzzle, so you may
>>> want to look into ways to circumnavigate this tesseract behaviour because
>>> you are feeding it stuff that's outside the original training domain
>>> (books, publications, academic papers).
>>> One approach to try is to go and cut the image up into individual
>>> character images and feed each to tesseract individually; you MAY observe
>>> better overall OCR results then.
>>>
>>> Second, since lstm is fundamentally like a Markov cha

Re: [tesseract-ocr] inaccuracy in plane text

2023-12-23 Thread Zdenko Podobny

 tesseract expects black text (lettering) on a white background: that's
what is has been trained on and that's what will work best. Hence: try to
convert anything to look like that before feeding it to Tesseract.


This is not needed (in all cases ;-) ): tesseract inverts a image by itself

for
LSTM and uses OCR results with the best confidence. Practically it does not
work for 100%. But if somebody cares about speed the best way is to use a
binarized image with a white background and black text + usage of
parameter tessedit_do_invert=0 (or new parameter

 invert_threshold=0.0)

(Someone did in depth research about this many years ago, published on this
list including charts, but i can't find the link within 60 seconds. Lazy
me, sorry)


"Willus Dotkom" - link is part of most ignored tesseract part
(documentation) - see
https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md#rescaling
:-)


Zdenko


pi 22. 12. 2023 o 19:51 Ger Hobbelt  napísal(a):

> Couple of things to check/test:
>
> - tesseract expects black text (lettering) on white background: that's
> what is has been trained on and that's what will work best. Hence: try to
> convert anything to look like that before feeding it to tesseract.
>
> - tesseract was trained on text, if I recall correctly, that's 11pt. Which
> is what you'll read in several places on the internet and is useless info
> as-is because pt (points) are a printer/publisher unit of measure for
> *paper* print, not computer images.
>  However, this translates to 30-50px total character height, including
> stem height for glyphs such as p,q,b and d, so the rule of thumb becomes:
> try to make your text line fit in 30 to 50 pixels height, for possibly best
> results. (Someone did in depth research about this many years ago,
> published on this list including charts, but i can't find the link within
> 60 seconds. Lazy me, sorry)
>
> - tesseract uses dictionary-like behaviour to help guestimate what it is
> actually seeing (lstm can be argued to behave like a Markov chain, old
> skool v3 OCR mode uses dictionaries) and that means tesseract very much
> likes to see human language "words". Stuff like, if you just saw a q, and
> your language in any Indo-European, you can bet your bottom the next glyph
> will be 'u'. As in: "QUestion".
>
> Yours, however, is a semi-random letter matrix for a puzzle, so you may
> want to look into ways to circumnavigate this tesseract behaviour because
> you are feeding it stuff that's outside the original training domain
> (books, publications, academic papers).
> One approach to try is to go and cut the image up into individual
> character images and feed each to tesseract individually; you MAY observe
> better overall OCR results then.
>
> Second, since lstm is fundamentally like a Markov chain (rather: core has
> Markov like behavioral aspects) and is NOT engineered for single glyph
> recognition, you may also want to see how classic tesseract V3 OCR modes
> are doing with your letter matrices as the older V3 engine is single-shape
> based and thus *potentially* more suitable for use against semi-random,
> independant, single character inputs like yours.
>
> My 2 cents. HTH
>
>
>
> On Wed, 20 Dec 2023, 13:33 Mishal Shanavas, 
> wrote:
>
>> i can not extract text with reliable accuracy of a simple text
>>
>> [image: crop.png]
>>
>>
>>
>> check it out
>>
>> https://colab.research.google.com/drive/11utvWD3s6DqqGZQEnk5cKIAj46ZLsF5y?usp=sharing
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/f86e2d35-4c35-4643-835f-109994e46632n%40googlegroups.com
>> 
>> .
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foDK7hCgpUEQES5aKFW-1Qfcs8R1H-1L%2BQQ%3D71G%2B8DNEQ%40mail.gmail.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it,

Re: [tesseract-ocr] Font Not Found Error

2023-12-20 Thread Zdenko Podobny

1. tesseract 4 is outdated.
2. tesstrain.sh is depreciated


Zdenko


st 20. 12. 2023 o 11:18 Uvindu Bimsara  napísal(a):

> When i started training tesseract 4.0 using tesstrain.sh for sinhala
> unicode font got this error.
> === Starting training for language 'sin' [Wed Dec 20 09:44:58 AM UTC 2023]
> /usr/bin/text2image --fonts_dir=fonts --ptsize 12 --font=SS-SuLakna
> --outputbase=/tmp/font_tmp.MoFCLmddzb/sample_text.txt
> --text=/tmp/font_tmp.MoFCLmddzb/sample_text.txt
> --fontconfig_tmpdir=/tmp/font_tmp.MoFCLmddzb Could not find font named
> 'SS-SuLakna'. Pango suggested font 'Bhashitha Bold'. Please correct --font
> arg. ERROR: Program text2image failed. Abort.
>
> Here is my code
> !rm -rf train/*
> ! /content/drive/MyDrive/nic_project/HNR/tesseract/src/training/tesstrain.sh
> --fonts_dir fonts \
>   --fontlist "SS-SuLakna" \
>   --lang sin \
>   --linedata_only \
>   --langdata_dir /content/drive/MyDrive/nic_project/HNR/langdata_lstm \
>   --tessdata_dir /content/drive/MyDrive/nic_project/HNR/tesseract/tessdata
> \
>   --save_box_tiff \
>   --maxpages 10 \
>   --output_dir train
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/019bf94a-c3bd-438a-b4e5-aca28de536c7n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zUhe4mY8jyu46q0RPjkeHBr_3gryC8aA4WBTrbFwMJUA%40mail.gmail.com.

Re: [tesseract-ocr] Numbers detection

2023-12-19 Thread Zdenko Podobny

Hello,

For Tesseract you need to remove all non-text parts (graphics element). IMO
also the outline number would be problematic.

It would be better to post the original image so people can play with
preprocessing...
See e.g. this discussion
https://groups.google.com/g/tesseract-ocr/c/YqW9XhbWC_8/m/75juLKoJDwAJ (not
sure if this is possible with javascript)

Zdenko


ut 19. 12. 2023 o 17:08 Aftab  napísal(a):

> Hey guys,
> I am very new to image processing & OCR. But after a lot of trial and
> error.
> I have reached to this point. I have a small image, cropped from larger
> input and the image is pre-processed to maximise the visibility of the
> number.
>
> It is able to detect 1 at the top, but it is not able to detect the
> number on the bottom.
> Here is the processed image I am working with.
> .[image: processed image.jpeg]
>
> I am running this in Browser using the tesseract.js node module, and here
> is my code for the detection: Tried with default pageseg_mode, as well as
> various other modes. 11 worked best out of all.
>
> async function recognizeText(image) {
> const worker = await createWorker('eng');
> await worker.setParameters({
> tessedit_char_whitelist: '0123456789',
> tessedit_pageseg_mode: '11',
> });
> const ret = await worker.recognize(image);
> console.log(ret.data.text);
> await worker.terminate();
> }
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/7e3e4683-2c35-42d2-ba72-8df2773d15b9n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8z2MekNBxxMw2V_5pH54s%2BkkgYm%2BmFq3aQWWa%3DxY4LgPA%40mail.gmail.com.

Re: [tesseract-ocr] getComponnentImages falling short of a few words/ characters

2023-12-17 Thread Zdenko Podobny

First of all, provide the original input image.
Next, it would be nice to see code to replicate the problem.

Zdenko


ne 17. 12. 2023 o 8:04 'Muhammad Ali' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> Hi team,
>
> I had a few recurring issues regarding inaccuracy of getComponentImages
> ROI boxes resulting in smaller ROIs than the actual words sample
> attachments provided. But I couldn't put a fingre on what could be causing
> this.
>
> I am using tesserOCR wrapper with Tesseract 5.3 underlying.
>
> Another question about the same getComponentImages API in tesseract is,
> does tesseract have a pre-recognition text detector like EAST for example?
> or is getComponentImages the same thing as getting the text recognized, and
> just the output is different i-e instead of text values, it turns them into
> ROI boxes with coordinates?
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/cdd9ce36-3776-4620-8a30-8bbc1acd2d32n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wXmM-QanW%3D4pPE8Lth_Pxth8Zdz1DujtMcfWMKuYA6Bw%40mail.gmail.com.

Re: [tesseract-ocr] Fasten Tesseract OCR

2023-12-14 Thread Zdenko Podobny

A more effective approach to addressing the issue is to create a
test/example case. Advanced users can then evaluate and potentially offer
solutions

It would be helpful if you could provide details on how you obtain and
process the input images, as well as the OCR execution method (API,
wrapper, executable). Examining this could reveal opportunities for speed
improvements, particularly by minimizing IO operations.

It's worth noting that there have been reported problems with OpenMP on
Linux and Mac in the context of extensive OCR tasks, as outlined in these
GitHub issues: [1], [2].
Investigating these and other performance related) issues may offer
insights into potential optimizations.

[1]
https://github.com/tesseract-ocr/tesseract/issues/943#issuecomment-179813
[2] https://github.com/tesseract-ocr/tesseract/issues/3109

Zdenko




Zdenko


št 30. 11. 2023 o 14:57 vadansh kulshreshtha <
vadansh.kulshresh...@einnosystech.com> napísal(a):

> I am using an i3 quad-core CPU. My scenario is that I want to process 100
> images in 1 sec including the image processing and cropping images. I
> create an ROI crop it and do the image processing then OCR. But what
> happens is that sometimes the same ROI takes more than 1 sec but sometimes
> it does it in 150-200ms. Also, I use the best train file of Tesseract.
> Also, the size of my ROI is not more than the size of a word. eg. "super@145
> &4califragilisticexpialidocious".
>
> For image processing, I do the thresholding, and zooming of images if
> required.
>
> Please do suggest to me the ways to get a reliable OCR processing time and
> also ways to fasten the OCR.
>
> Thank you
>
> On Wednesday, 29 November 2023 at 20:04:25 UTC+5:30 zdenop wrote:
>
>> Your request is too general e.g.  reply could be "upgrade your
>> hardware"... ;-)
>>
>> Unless you provide details about your testing environment + process of
>> measuring speed and testing images, there is just one general advice: read
>> the docs and issue tracker (including closed issues), there are several
>> discussions (and hints) regarding speed.
>>
>> Zdenko
>>
>>
>> st 29. 11. 2023 o 13:53 vadansh kulshreshtha <
>> vadansh.ku...@einnosystech.com> napísal(a):
>>
>>> Hello Everyone,
>>>
>>> I am using Tesseract OCR 5.2 and I want to speed up my OCR process so
>>> for that, could any help me with the same? It would be a great help for me.
>>> Also can anyone tell me all the parameters that affect the speed of OCR.
>>>
>>> Thank you
>>>
>>> --
>>>
>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/7ec9293f-798e-48e0-a742-c6ece2775165n%40googlegroups.com
>>> 
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/e720c41d-16b0-4268-9a1e-a197db2dcc13n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wwX3YYP%2BQAMAE8R6Qbz4qnkXDToa1Gj-ebnzsLkHLnVQ%40mail.gmail.com.

Re: [tesseract-ocr] Fasten Tesseract OCR

2023-11-29 Thread Zdenko Podobny

Your request is too general e.g.  reply could be "upgrade your hardware"...
;-)

Unless you provide details about your testing environment + process of
measuring speed and testing images, there is just one general advice: read
the docs and issue tracker (including closed issues), there are several
discussions (and hints) regarding speed.

Zdenko


st 29. 11. 2023 o 13:53 vadansh kulshreshtha <
vadansh.kulshresh...@einnosystech.com> napísal(a):

> Hello Everyone,
>
> I am using Tesseract OCR 5.2 and I want to speed up my OCR process so for
> that, could any help me with the same? It would be a great help for me.
> Also can anyone tell me all the parameters that affect the speed of OCR.
>
> Thank you
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/7ec9293f-798e-48e0-a742-c6ece2775165n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zW9ca-Bk1TV40w1scc9-5rN1LS6mxSDhVokjOXyCSpUA%40mail.gmail.com.

Re: [tesseract-ocr] Tesseract on single digit detection

2023-11-27 Thread Zdenko Podobny

Crop images properly (without borders) and follow suggestions in docs:

>tesseract pic2_cropped_postprocessed.png - --psm 10
5
>tesseract pic4_cropped_postprocessed.png - --psm 10
7

Zdenko


po 27. 11. 2023 o 9:42 Fernando Benayas de los Santos <
ferbenaya...@gmail.com> napísal(a):

> Hi guys/gals!
> Long story short: I'm trying to use tesseract to extract a single digit
> from a small image (containing a single digit) but I can't get over ~50%
> accuracy.
>
> I have attached some examples of images. It should be pretty easy to
> extract the digit.
>
> So far, the best approach consists in using --psm 13 and zoom a bit to get
> the frame out. Tried to change the image to black/white, but it didn't
> solve much :(
>
> Any ideas? (I'm playing around with obscure -c options but they don't seem
> to have any effect)
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/5784d031-b282-4a99-b5e0-3b313a121488n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yP%2Be-ZE1R5uNJH%3DxiNnzfKOBAJ_tVuSX4ZduAp-_dK1g%40mail.gmail.com.

Re: [tesseract-ocr] Failed loading language 'eng'

2023-11-25 Thread Zdenko Podobny

tesseract 3.x is unsupported.

I am not Java developer, but according
https://github.com/nguyenq/tess4j/releases tess4j-5.8.0 should
support Tesseract 5.3.2, so I would start from that.
If there is still a problem have a look at their wiki (
https://github.com/nguyenq/tess4j/wiki) and issue tracker.


Zdenko


so 25. 11. 2023 o 17:48 'sanogo sy' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> Too stupid, my bad!
> Could someone give me some advice to install required version.
> I use tess4j 5.4.0.jar in my application. In local on windows OS, I tried
> another version of tess4j but it didn't work, so I kept tess4j 5.4.0.
> Now I had to make it run in linux Centos 7.
>
> I tried many documentation like:
> https://gist.github.com/lorne-luo/ddfdbf3655e068669ba27d80060cabf8
>
> https://stackoverflow.com/questions/23792373/installing-tesseract-ocr-on-centos-6
>
> I also tried like that:
>
> wget http://www.leptonica.org/source/leptonica-1.79.0.tar.gz
> wget https://github.com/tesseract-ocr/tesseract/archive/5.3.0.tar.gz
>
> Configure, compile, install libs:
>
> tar xzvf leptonica-1.79.0.tar.gz
> cd leptonica-1.79.0
> ./configure
> make
> make install
>
> cd ..
>
>
> tar xzf 5.3.0.tar.gz
> cd tesseract-5.3.0
> ./autogen.sh
> ./configure
> make
> sudo make install
> sudo ldconfig
>
>
> I tried also that way:
>
> wget http://www.leptonica.org/source/leptonica-1.69.tar.gz
> wget
> https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz
>
> tar xzvf leptonica-1.69.tar.gz
> cd leptonica-1.69
> ./configure
> make
> sudo make install
>
> tar xzf tesseract-ocr-3.02.02.tar.gz
> cd tesseract-3.01
> ./autogen.sh
> ./configure
> make
> sudo make install
> sudo ldconfig
>
> wget
> http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.eng.tar.gz
>
> tar xzf tesseract-ocr-3.02.eng.tar.gz
> sudo cp tesseract-ocr/tessdata/* /usr/local/share/tessdata
>
> But I get error like could not initialized tess4j error.
> So, I need help to install right version for making work in linux OS
> centos 7, with java 8 and tess4j 5.4.0. My application is running on a
> wildfly server version 24.
>
> Thank's in advance!
>
> On Saturday, November 25, 2023 at 4:30:46 PM UTC zdenop wrote:
>
>> you used an old unsupported version of your tools (not sure if the
>> problem is in the used/installed wrapper or Tesseract library...)  - the
>> cube engine was removed from Tesseract several years ago...
>>
>>
>> Zdenko
>>
>>
>> so 25. 11. 2023 o 15:31 'sanogo sy' via tesseract-ocr <
>> tesser...@googlegroups.com> napísal(a):
>>
>>> But in my app that running in server wildfly 24, I got error say: Failed
>>> loading language 'eng'.
>>> In my log file I got:
>>>
>>> Failed loading language 'eng'
>>> Cube ERROR (CubeRecoContext::Load): unable to read cube language model
>>> params from /tmp/tess4j/tessdata/fra.cube.lm
>>> Cube ERROR (CubeRecoContext::Create): unable to init CubeRecoContext
>>> object
>>> init_cube_objects(false, &tessdata_manager):Error:Assert failed:in file
>>> tessedit.cpp, line 210
>>> #
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> #
>>> #  SIGSEGV (0xb) at pc=0x7fb7e88ac249, pid=56208,
>>> tid=0x7fb7ed342700
>>> #
>>> # JRE version: OpenJDK Runtime Environment (8.0_131-b12) (build
>>> 1.8.0_131-b12)
>>> # Java VM: OpenJDK 64-Bit Server VM (25.131-b12 mixed mode linux-amd64
>>> compressed oops)
>>> # Problematic frame:
>>> # C  [libtesseract.so+0x239249]  ERRCODE::error(char const*,
>>> TessErrorLogCode, char const*, ...) const+0x129
>>>
>>>
>>> On Saturday, November 25, 2023 at 1:25:39 PM UTC sanogo sy wrote:
>>>
 If I well understood, you mean by tesseract (executable) to run
 tesseract command on purpose to check how it works.
 I just run command: tesseract  path_of_my_image.jpg  output.txt
 My output file is empty. It seems that it doesn't work because I got in
 my command line message :

 Estimating resolution as 181
 Error in boxClipToRectangle: box outside rectangle
 Error in pixScanForForeground: invalide box

 On Saturday, November 25, 2023 at 1:09:33 PM UTC zdenop wrote:

> And the result is?
>
>
> Zdenko
>
>
> so 25. 11. 2023 o 13:07 'sanogo sy' via tesseract-ocr <
> tesser...@googlegroups.com> napísal(a):
>
>> I forgot to mentione that I use Centos 7.
>> I tried that command : tesseract img.jpg out
>>
>> As result I got a message like:
>>
>> Estimating resolution as 181
>> Error in boxClipToRectangle: box outside rectangle
>> Error in pixScanForForeground: invalide box
>>
>> On Saturday, November 25, 2023 at 10:31:49 AM UTC zdenop wrote:
>>
>>> Does tesseract (executable) has the same problem?
>>> If yes, that check the
>>> content of /usr/share/tesseract-ocr/4/tessdata/
>>> If not follow code of tesseract executable.
>>>
>>>
>>> Zdenko
>>>
>>>
>>> so 25. 11. 2023 o 11:07 'sanogo sy' v

Re: [tesseract-ocr] Failed loading language 'eng'

2023-11-25 Thread Zdenko Podobny

you used an old unsupported version of your tools (not sure if the problem
is in the used/installed wrapper or Tesseract library...)  - the cube
engine was removed from Tesseract several years ago...


Zdenko


so 25. 11. 2023 o 15:31 'sanogo sy' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> But in my app that running in server wildfly 24, I got error say: Failed
> loading language 'eng'.
> In my log file I got:
>
> Failed loading language 'eng'
> Cube ERROR (CubeRecoContext::Load): unable to read cube language model
> params from /tmp/tess4j/tessdata/fra.cube.lm
> Cube ERROR (CubeRecoContext::Create): unable to init CubeRecoContext object
> init_cube_objects(false, &tessdata_manager):Error:Assert failed:in file
> tessedit.cpp, line 210
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7fb7e88ac249, pid=56208,
> tid=0x7fb7ed342700
> #
> # JRE version: OpenJDK Runtime Environment (8.0_131-b12) (build
> 1.8.0_131-b12)
> # Java VM: OpenJDK 64-Bit Server VM (25.131-b12 mixed mode linux-amd64
> compressed oops)
> # Problematic frame:
> # C  [libtesseract.so+0x239249]  ERRCODE::error(char const*,
> TessErrorLogCode, char const*, ...) const+0x129
>
>
> On Saturday, November 25, 2023 at 1:25:39 PM UTC sanogo sy wrote:
>
>> If I well understood, you mean by tesseract (executable) to run tesseract
>> command on purpose to check how it works.
>> I just run command: tesseract  path_of_my_image.jpg  output.txt
>> My output file is empty. It seems that it doesn't work because I got in
>> my command line message :
>>
>> Estimating resolution as 181
>> Error in boxClipToRectangle: box outside rectangle
>> Error in pixScanForForeground: invalide box
>>
>> On Saturday, November 25, 2023 at 1:09:33 PM UTC zdenop wrote:
>>
>>> And the result is?
>>>
>>>
>>> Zdenko
>>>
>>>
>>> so 25. 11. 2023 o 13:07 'sanogo sy' via tesseract-ocr <
>>> tesser...@googlegroups.com> napísal(a):
>>>
 I forgot to mentione that I use Centos 7.
 I tried that command : tesseract img.jpg out

 As result I got a message like:

 Estimating resolution as 181
 Error in boxClipToRectangle: box outside rectangle
 Error in pixScanForForeground: invalide box

 On Saturday, November 25, 2023 at 10:31:49 AM UTC zdenop wrote:

> Does tesseract (executable) has the same problem?
> If yes, that check the content of /usr/share/tesseract-ocr/4/tessdata/
> If not follow code of tesseract executable.
>
>
> Zdenko
>
>
> so 25. 11. 2023 o 11:07 'sanogo sy' via tesseract-ocr <
> tesser...@googlegroups.com> napísal(a):
>
>> Hi every one. I got an error with tesseract. When I try to use it in
>> my app, I got an error like "Failed loading language eng".
>> I installed tesseract 5 with leptonica 1.79
>>
>> To solve the problem I tried  that command :
>> export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4/tessdata/
>> I cloned from git repo tesseract tessdata:
>> https://github.com/tesseract-ocr/tessdata.git
>> Then I moved files in /usr/share/tesseract-ocr/4/tessdat/ folder.
>> But it still not working.
>>
>> I really need help, please. I've been working for 3 days.
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/3ac7cbbe-6481-46da-b14f-7c933f499414n%40googlegroups.com
>> 
>> .
>>
> --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesseract-oc...@googlegroups.com.

>>> To view this discussion on the web visit
 https://groups.google.com/d/msgid/tesseract-ocr/985b43a4-57b9-4854-b27f-66095cdb72cen%40googlegroups.com
 
 .

>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ca52aa7d-619e-42b3-99dc-dfddf1e7e8d3n%40googlegroups.com
> 
> .
>

-- 
You received this message because you a

Re: [tesseract-ocr] Failed loading language 'eng'

2023-11-25 Thread Zdenko Podobny

And the result is?


Zdenko


so 25. 11. 2023 o 13:07 'sanogo sy' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> I forgot to mentione that I use Centos 7.
> I tried that command : tesseract img.jpg out
>
> As result I got a message like:
>
> Estimating resolution as 181
> Error in boxClipToRectangle: box outside rectangle
> Error in pixScanForForeground: invalide box
>
> On Saturday, November 25, 2023 at 10:31:49 AM UTC zdenop wrote:
>
>> Does tesseract (executable) has the same problem?
>> If yes, that check the content of /usr/share/tesseract-ocr/4/tessdata/
>> If not follow code of tesseract executable.
>>
>>
>> Zdenko
>>
>>
>> so 25. 11. 2023 o 11:07 'sanogo sy' via tesseract-ocr <
>> tesser...@googlegroups.com> napísal(a):
>>
>>> Hi every one. I got an error with tesseract. When I try to use it in my
>>> app, I got an error like "Failed loading language eng".
>>> I installed tesseract 5 with leptonica 1.79
>>>
>>> To solve the problem I tried  that command :
>>> export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4/tessdata/
>>> I cloned from git repo tesseract tessdata:
>>> https://github.com/tesseract-ocr/tessdata.git
>>> Then I moved files in /usr/share/tesseract-ocr/4/tessdat/ folder.
>>> But it still not working.
>>>
>>> I really need help, please. I've been working for 3 days.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/3ac7cbbe-6481-46da-b14f-7c933f499414n%40googlegroups.com
>>> 
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/985b43a4-57b9-4854-b27f-66095cdb72cen%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wZqj%2BvqnDJCZ5FR2Yg7vQfW1jy9ZOm38raxVmAMJbntw%40mail.gmail.com.

Re: [tesseract-ocr] Failed loading language 'eng'

2023-11-25 Thread Zdenko Podobny

Does tesseract (executable) has the same problem?
If yes, that check the content of /usr/share/tesseract-ocr/4/tessdata/
If not follow code of tesseract executable.


Zdenko


so 25. 11. 2023 o 11:07 'sanogo sy' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> Hi every one. I got an error with tesseract. When I try to use it in my
> app, I got an error like "Failed loading language eng".
> I installed tesseract 5 with leptonica 1.79
>
> To solve the problem I tried  that command :
> export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4/tessdata/
> I cloned from git repo tesseract tessdata:
> https://github.com/tesseract-ocr/tessdata.git
> Then I moved files in /usr/share/tesseract-ocr/4/tessdat/ folder.
> But it still not working.
>
> I really need help, please. I've been working for 3 days.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/3ac7cbbe-6481-46da-b14f-7c933f499414n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yPXn7VUHVR4u57NNmXtngG_3HuXPM2K6dq7VQgUXJPkw%40mail.gmail.com.

Re: [tesseract-ocr] I am unable to train a new font to tesseract, I am getting a deserialize failed error

2023-11-23 Thread Zdenko Podobny

Please provide files for replicating the problem, otherwise

Zdenko


št 23. 11. 2023 o 8:29 Adepu Sai Rahul  napísal(a):

> the tif files are not corrupted and box files are not of size zero
>
>
> On Thursday, November 23, 2023 at 12:51:49 PM UTC+5:30 desal...@gmail.com
> wrote:
>
>> Make sure that the tif files are not corrupted; or the box files are not
>> zero size.
>>
>> Des
>>
>> On 23 Nov 2023 at 9:26:39 AM, Adepu Sai Rahul 
>> wrote:
>>
>>>
>>> chinnu@SaiRahul2507:~/tesseract_tutorial/tesstrain$
>>> TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Y145
>>> START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=200
>>> You are using make version: 4.3
>>> lstmtraining \
>>>   --debug_interval 0 \
>>>   --traineddata data/Y145/Y145.traineddata \
>>>   --old_traineddata ../tesseract/tessdata/eng.traineddata \
>>>   --continue_from data/eng/Y145.lstm \
>>>   --learning_rate 0.0001 \
>>>   --model_output data/Y145/checkpoints/Y145 \
>>>   --train_listfile data/Y145/list.train \
>>>   --eval_listfile data/Y145/list.eval \
>>>   --max_iterations 200 \
>>>   --target_error_rate 0.01
>>> Loaded file data/eng/Y145.lstm, unpacking...
>>> Warning: LSTMTrainer deserialized an LSTMRecognizer!
>>> Code range changed from 111 to 111!
>>> Num (Extended) outputs,weights in Series:
>>>   1,36,0,1:1, 0
>>> Num (Extended) outputs,weights in Series:
>>>   C3,3:9, 0
>>>   Ft16:16, 160
>>> Total weights = 160
>>>   [C3,3Ft16]:16, 160
>>>   Mp3,3:16, 0
>>>   TxyLfys64:64, 20736
>>>   Lfx96:96, 61824
>>>   RxLrx96:96, 74112
>>>   Lfx512:512, 1247232
>>>   Fc111:111, 56943
>>> Total weights = 1461007
>>> Previous null char=110 mapped to 110
>>> Continuing from data/eng/Y145.lstm
>>> Deserialize failed: data/Y145-ground-truth/eng_0.tif read 0/1229531648
>>> lines
>>>
>>> in list.train I put some paths to tif files
>>> how to solve this
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/ae675f4b-c5ab-4322-8171-1c68f47bfa92n%40googlegroups.com
>>> 
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/1dafef42-0e78-4b84-84a2-25ba9a98d8f6n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zkqjZuX1ocsWT8VphTnBEHhB_qa4%2BEoDgkGkry0Z-Aig%40mail.gmail.com.

Re: [tesseract-ocr] Re: Training from Scratch

2023-11-23 Thread Zdenko Podobny

št 23. 11. 2023 o 10:28 Des Bw  napísal(a):

> If the original model lacks the ∠ symbol, fine tuning is not going to add
> it for you.


Really???
Tesseract documentation
:
Fine tuning is the process of training an existing model on new data
without changing any part of the network, although you *can* now add
characters to the character set. (See Fine Tuning for ± a few characters

).



> We have all went through that process. To introduce a new character,
> removing the top layer and train from there is the most
> effective approach.
>
> On Thursday, November 23, 2023 at 12:15:56 PM UTC+3 smon...@gmail.com
> wrote:
>
>> If I need to train new characters that are not recognized by a default
>> model, is fine tuning in this case the right approach?
>> One of these characters ist the one for angularity:  ∠
>>
>> This symbols appear in technical drawings and should be recognised in
>> those. E.g. for the scenario in the following picture tesseract should
>> reconize this symbol.
>>
>>
>>
>> [image: angularity.png]
>>
>> Also here is one of the pngs I tried to train with:
>> [image: angularity_0_r0.jpg]
>> They all look pretty similar to this one. Things that change are the
>> angle, the propotion and the thickness of the lines. All examples have this
>> 64x64 pixel box around it.
>>
>>
>> Is Fine Tuning for this scenario the right approach as I only find
>> information for fine tuning for specific fonts. For fine tune also the
>> "tesstrain" repository would not be needed as it is used for training from
>> scratch, correct?
>> desal...@gmail.com schrieb am Mittwoch, 22. November 2023 um 15:27:02
>> UTC+1:
>>
>>> From my limited experience, you need a lot more data than that to train
>>> from scratch. If you can't make more than that data, you might first try to
>>> fine tune:and then train by removing the top layer of the best model.
>>>
>>> On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 smon...@gmail.com
>>> wrote:
>>>
 As it is not properly possible to combine my traineddata from scratch
 with an existing one, I have decided to also train my traineddata model
 numbers. Therefore I wrote a script which synthetically generates
 groundtruth data with text2image.
 This script uses dozens of different fonts and creates numbers for the
 following formats.
 X.XXX
 X.XX
 X,XX
 X,XXX
 I generated 10,000 files to train the numbers. But unfortunately
 numbers get recognized pretty poorly with the best model. (most of times
 only "0."; "0" or "0," gets recognized)
 So I wanted to ask if It is not enough training (ground truth data) for
 proper recognition when I train several fonts.
 Thanks in advance for you help.

>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/fb4a1b27-db44-49a6-adfa-ada9e13030aan%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wt2bNNDBQoBBDGezC_UCScqeaGXS6eyTFf8boam5s%2Bgg%40mail.gmail.com.

Re: [tesseract-ocr] Troubling with reading text from image

2023-11-19 Thread Zdenko Podobny

Captcha was created to fool OCR, so Tesseract output is as expected ;-)

Zdenko


ne 19. 11. 2023 o 19:15 Исмаилов Ориф  napísal(a):

> Hi, i have images where i should read text and numbers, but i am having
> trouble with this
> [image: Снимок экрана 2023-11-19 230906.png]
> here is what tessaract gave
> [image: Снимок экрана 2023-11-19 230934.png]
> Thank you in advance)
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/05d6750c-2275-4483-9b6f-c28f847b336dn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zdajAOJoOx2wpYkn5TCbAvK_99JrAipgcNqM75AvA2Yg%40mail.gmail.com.

Re: [tesseract-ocr] Dictionary?

2023-11-19 Thread Zdenko Podobny

AFAIR there were tests with the legacy engine where the effect of improving
results quality by dictionaries where measured as 10-15% for common text.
However: adding a word to a dictionary has never ensured Tesseract's
accurate recognition of that word.
For non-word inputs (e.g. serial numbers ...) it was always suggested to
turn off dictionaries.
IMO results depend on the input image quality (for good image quality it
seems like no effect). If you need more detail/experiences dig into the
history of this forum (especially after releasing first version 3).

I never heard that anybody would do such a test for the LSTM engine.

Zdenko


ne 19. 11. 2023 o 18:37 Des Bw  napísal(a):

> Does Tesseract actually use the dictionary (wordlist) included into the
> model (traineddata file)?
>
> - I am not getting any difference/impact by including a dictionary (word
> list) into the file.
>
> Has anybody experimented with a dictionary set up?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/381c213c-da12-482a-accf-e6847c0fc01bn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zRPX6wxb7U38HqittfFh1Wg9_1xPrwoTZYba357gWQvg%40mail.gmail.com.

Re: [tesseract-ocr] DLL runtime issues with API on Windows

2023-11-11 Thread Zdenko Podobny

Please provide full information to replicate the problem (exact code, how
did you completed it...)

Zdenko


so 11. 11. 2023 o 15:20 Anthony Vallone 
napísal(a):

> Hello,
>
> I am using MSYS2 to install tesseract on Windows, following the installation
> instructions :
>
> pacman -S mingw-w64-x86_64-tesseract-ocr
> pacman -S mingw-w64-x86_64-tesseract-data-eng
>
> Then, I am attempting to run the example code
> .
>
> When I run my executable, I get the following pop up error:
>
> the procedure entry point _ZSt17_istream_extractRSiPcx could not be
> located in the dynamic link library C:\msys64\mingw64\bin\libtesseract-5.dll
>
> Any thoughts on what is causing this?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/e6279840-4f40-474f-ace2-2b9f73fe064dn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wRHAkYp34A9ZOzYENDLiSfOLu7UFEaKymvxGjwCJzG%2BA%40mail.gmail.com.

Re: [tesseract-ocr] LSTM-based training produces .box files with the same coordinates

2023-11-01 Thread Zdenko Podobny

Are you following official tutorials?
Did you read the documentation?
Have you tried to check the official training repository and provided
examples?

Zdenko


st 1. 11. 2023 o 10:15 TRAN TRONG KHANH[학생](대학원 컴퓨터공학과) ‍ <
khanhtran...@khu.ac.kr> napísal(a):

> Hi all,
>
> I tried to run an example of LSTM training and used the following command:
>
>
>
>
> *for f in *.tif; dotesseract $f ${f%.*} -l deu lstmbox done*
> The result of box files seems detect by single-level box instead of
> character-level box. All the character shares the same coordinates, width
> and height. Is it a features of tesseract LSTM traning? Thanks.
>
> [image: Untitled.png]
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/5f19f8c5-b728-4b97-888d-76dc60d829acn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zD7rOMOJL4J5oyqgN7xQP%3D8fxne0hbD4i85w39xOafxg%40mail.gmail.com.

Re: [tesseract-ocr] Getting Error: No such file or directory: 'data/foo/all-lstmf'

2023-10-28 Thread Zdenko Podobny

It does not work on windows (directly) but it works on linux => use WSL if
you really need training.
Or wait until somebody find a fix for windows (or send the fix - this is an
open source project so everybody should contribute ;-) )

Zdenko


pi 27. 10. 2023 o 17:32 Dev Solution  napísal(a):

>
> I just tried to run these all commands, but I got error
> https://prnt.sc/lLHeR27J2U65
>
> On Tuesday, June 6, 2023 at 10:03:17 AM UTC+2 zdenop wrote:
>
>> Do not create files manually.
>> If "make training" does not work it means:
>>
>>1. you miss some dependency or input data are wrong
>>2. also you miss error message for 1.
>>
>> I strongly suggest you to start training from the beginning
>> (including cloning tesstraing) and pay attention to all messages:
>>
>> git clone --depth 1 https://github.com/tesseract-ocr/tesstrain.git
>> cd tesstrain
>> make tesseract-langdata
>> mkdir tessdata_best
>> wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
>> -P tessdata_best
>> unzip ocrd-testset.zip -d data/ocrd-ground-truth
>> make training MODEL_NAME=ocrd TESSDATA=tessdata_best MAX_ITERATIONS=1
>>
>>
>> Zdenko
>>
>>
>> po 5. 6. 2023 o 4:22 Madhav Pandey  napísal(a):
>>
>>> Hi Zdenop,
>>>
>>> Apologies. I got your name wrong in the thread.
>>>
>>> Can you please help me in resolving this issue? Because make training
>>> command was not creating the all-gt file. I manually created it and kept it
>>> at the MODEL_NAME directory.
>>>
>>> The way I created it was by copy over all the single lines from the text
>>> files and storing it in the all-gt file. I am not sure if this is the right
>>> approach. Please correct me if I am wrong here.
>>>
>>> Now after doing this, i am getting this error:
>>>
>>> python3 shuffle.py 0 "data/Apex/all-lstmf"
>>> Traceback (most recent call last):
>>>   File
>>> "/Users/madpande/Code/git/tesseract_tutorial/tesstrain/shuffle.py", line
>>> 24, in 
>>> fd0 = open(sys.argv[2], 'r')
>>> FileNotFoundError: [Errno 2] No such file or directory:
>>> 'data/Apex/all-lstmf'
>>>
>>>
>>> I am pretty sure I am missing something here. Please help!
>>>
>>> Thanks!
>>>
>>> On Thursday, 1 June 2023 at 23:39:01 UTC-6 Madhav Pandey wrote:
>>>
 Hi Zdenko,

 At what step in the make file the all-gt file is created? I am still
 unable to move forward with the custom model training.

 Any help would be greatly appreciated. Thanks!

 On Wednesday, 26 April 2023 at 09:47:55 UTC-6 zdenop wrote:

> make training TESSDATA=./usr/local/share/tessdata
> unicharset_extractor --output_unicharset "data/foo/unicharset"
> --norm_mode 2 "data/foo/all-gt"
>
> Failed to read data from: data/foo/all-gt
>
>
> This indicates you already run training that failed...
> Clean your training and start it once again. Pay attention to why
> "data/foo/all-gt" is not created (there will be an error message).
>
> Zdenko
>
>
> st 26. 4. 2023 o 2:07 Madhav Pandey  napísal(a):
>
>> @zdenop
>>
>> This is the entire training output:
>>
>> ```make training TESSDATA=./usr/local/share/tessdata
>> unicharset_extractor --output_unicharset "data/foo/unicharset"
>> --norm_mode 2 "data/foo/all-gt"
>> Failed to read data from: data/foo/all-gt
>> Wrote unicharset file data/foo/unicharset
>> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i
>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif" -t
>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.gt.txt" >
>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.box"
>> set -x; \
>> tesseract
>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif"
>> data/foo-ground-truth/alexis_ruhe01_1852_0087_027 --psm 13 lstm.train
>> + tesseract data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif
>> data/foo-ground-truth/alexis_ruhe01_1852_0087_027 --psm 13 lstm.train
>> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i
>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif" -t
>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.gt.txt" >
>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.box"
>> set -x; \
>> tesseract
>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif"
>> data/foo-ground-truth/alexis_ruhe01_1852_0018_022 --psm 13 lstm.train
>> + tesseract data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif
>> data/foo-ground-truth/alexis_ruhe01_1852_0018_022 --psm 13 lstm.train
>> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i
>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif" -t
>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.gt.txt" >
>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.box"
>> set -x; \
>> tesseract
>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif"
>> data/foo-ground-truth/alexis_ruhe01_1852_0035_019 --psm 13 lstm.train
>>>

Re: [tesseract-ocr] OCR Output contains "xlz"

2023-10-15 Thread Zdenko Podobny

Seam like you should put this question to the author of language data
"ARYuanB5-MD"...

Zdenko


ne 15. 10. 2023 o 15:44 'Danny Wilson' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> Running tesseract on a single Chinese character "對" outputs the character,
> but also the text "xlz".
>
> Command line:
> tesseract sub0089w.png debugOut -l ARYuanB5-MD --dpi 72 --psm 6 -c
> preserve_interword_spaces=1
>
> The output is two lines:
> xlz
> 對
>
> It used to output "sMz"  but after retraining several times with the
> specific font in use, it now outputs "xlz".
>
> Why?
>
> I've attached the image file in question...
>
> [image: sub0089w.png]
>
> (Searching the source code, the file universalambigs.h has a line " xlZ le
> 1" which is similar, but not exact to the errant text I'm finding)
>
> Thank you.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/76ed2f78-e10f-4b9f-8d61-30f4b0f333dbn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y1_y%3Diw8uCEw5Z3km%3DApZ5%2BFFudjqMKV_HO9QJ41FNyw%40mail.gmail.com.

Re: [tesseract-ocr] "Leptonica was build without TIFF support! Disabling TIFF support..."

2023-10-15 Thread Zdenko Podobny

Honestly, this is a very messy configuration for me. Why? Tesseract (and
other projects) use CMake to avoid such manual settings.

Just follow the example in our GitHub action for cmake&windows[1] - it is
simply stupid and it works. Cmake takes care of correct linking
(debug/release), and build (no need to run nmake)


[1]
https://github.com/tesseract-ocr/tesseract/blob/main/.github/workflows/cmake-win64.yml


Zdenko


so 14. 10. 2023 o 16:13 DJuego Director De Juego 
napísal(a):

> a) I have attached the construction logs of leptonica and tesseract.  Yes,
> I can guarantee that I only have one version of leptonica installed.
>
> For leptonica and tesseract I use a script for the build.
>
> And they seem to work fine because it generates all the files I need and
> gives no errors. Here I present the excerpts from the script that I think
> are the decisive ones.
>
> For debug
> cmake -G 'NMake Makefiles' -D CMAKE_BUILD_TYPE=Debug -D SW_BUILD:BOOL=OFF
> -D BUILD_PROG:BOOL=ON -D ENABLE_TIFF:BOOL=ON -D ENABLE_ZLIB:BOOL=ON -D
> ENABLE_PNG:BOOL=ON -D ENABLE_JPEG:BOOL=ON -D ENABLE_GIF:BOOL=ON -D
> CMAKE_INSTALL_PREFIX=$DIRECTORIO_INSTALACION_LIBRERIA/DEBUG -D
> ZLIB_LIBRARY_DEBUG=$DIRECTORIO_INSTALACION/zlib/DEBUG/lib/zlibd.lib -D
> ZLIB_LIBRARY_RELEASE=$DIRECTORIO_INSTALACION/zlib/RELEASE/lib/zlib.lib  -D
> ZLIB_INCLUDE_DIR=$DIRECTORIO_INSTALACION/zlib/DEBUG/include -D
> TIFF_LIBRARY_DEBUG=$DIRECTORIO_INSTALACION/tiff/DEBUG/lib/tiffd.lib -D
> TIFF_LIBRARY_RELEASE=$DIRECTORIO_INSTALACION/tiff/RELEASE/lib/tiff.lib  -D
> TIFF_INCLUDE_DIR=$DIRECTORIO_INSTALACION/tiff/DEBUG/include -D
> PNG_LIBRARY=$DIRECTORIO_INSTALACION/libpng/DEBUG/lib/libpng16d.lib -D
> PNG_PNG_INCLUDE_DIR=$DIRECTORIO_INSTALACION/libpng/DEBUG/include -D
> JPEG_LIBRARY_DEBUG=$DIRECTORIO_INSTALACION/jpeg/DEBUG/lib/jpeg.lib -D
> JPEG_LIBRARY_RELEASE=$DIRECTORIO_INSTALACION/jpeg/RELEASE/lib/jpeg.lib -D
> JPEG_INCLUDE_DIR=$DIRECTORIO_INSTALACION/jpeg/DEBUG/include -D
> GIF_LIBRARY=$DIRECTORIO_INSTALACION/giflib/DEBUG/lib/gif.lib -D
> GIF_INCLUDE_DIR=$DIRECTORIO_INSTALACION/giflib/DEBUG/include ../..
>  nmake
>  nmake install
>
> For release
> cmake -G 'NMake Makefiles' -D CMAKE_BUILD_TYPE=Release -D
> SW_BUILD:BOOL=OFF -D BUILD_PROG:BOOL=ON -D ENABLE_TIFF:BOOL=ON -D
> ENABLE_ZLIB:BOOL=ON -D ENABLE_PNG:BOOL=ON -D ENABLE_JPEG:BOOL=ON -D
> ENABLE_GIF:BOOL=ON -D
> CMAKE_INSTALL_PREFIX=$DIRECTORIO_INSTALACION_LIBRERIA/RELEASE -D
> ZLIB_LIBRARY_DEBUG=$DIRECTORIO_INSTALACION/zlib/DEBUG/lib/zlibd.lib -D
> ZLIB_LIBRARY_RELEASE=$DIRECTORIO_INSTALACION/zlib/RELEASE/lib/zlib.lib -D
> ZLIB_INCLUDE_DIR=$DIRECTORIO_INSTALACION/zlib/RELEASE/include -D
> TIFF_LIBRARY_DEBUG=$DIRECTORIO_INSTALACION/tiff/DEBUG/lib/tiffd.lib -D
> TIFF_LIBRARY_RELEASE=$DIRECTORIO_INSTALACION/tiff/RELEASE/lib/tiff.lib -D
> TIFF_INCLUDE_DIR=$DIRECTORIO_INSTALACION/tiff/RELEASE/include -D
> PNG_LIBRARY=$DIRECTORIO_INSTALACION/libpng/RELEASE/lib/libpng16.lib -D
> PNG_PNG_INCLUDE_DIR=$DIRECTORIO_INSTALACION/libpng/RELEASE/include -D
> JPEG_LIBRARY_DEBUG=$DIRECTORIO_INSTALACION/jpeg/DEBUG/lib/jpeg.lib -D
> JPEG_LIBRARY_RELEASE=$DIRECTORIO_INSTALACION/jpeg/RELEASE/lib/jpeg.lib -D
> JPEG_INCLUDE_DIR=$DIRECTORIO_INSTALACION/jpeg/RELEASE/include -D
> GIF_LIBRARY=$DIRECTORIO_INSTALACION/giflib/RELEASE/lib/gif.lib -D
> GIF_INCLUDE_DIR=$DIRECTORIO_INSTALACION/giflib/RELEASE/include ../..
>  nmake
>  nmake install
>
>
> For debug
> cmake -G 'NMake Makefiles' -D CMAKE_BUILD_TYPE=Debug -D SW_BUILD:BOOL=OFF
> -D BUILD_TRAINING_TOOLS:BOOL=OFF -D
> Leptonica_DIR=$DIRECTORIO_INSTALACION/leptonica/DEBUG/lib/cmake/leptonica
> -D TIFF_LIBRARY=$DIRECTORIO_INSTALACION/tiff/DEBUG/lib/tiffd.lib -D
> TIFF_INCLUDE_DIR=$DIRECTORIO_INSTALACION/tiff/DEBUG/include -D
> CMAKE_INSTALL_PREFIX=$DIRECTORIO_INSTALACION_LIBRERIA/DEBUG ../..
> nmake
> nmake install
>
> For release
> cmake -G 'NMake Makefiles' -D CMAKE_BUILD_TYPE=Release -D
> SW_BUILD:BOOL=OFF -D BUILD_TRAINING_TOOLS:BOOL=OFF -D
> Leptonica_DIR=$DIRECTORIO_INSTALACION/leptonica/RELEASE/lib/cmake/leptonica
> -D TIFF_LIBRARY=$DIRECTORIO_INSTALACION/tiff/RELEASE/lib/tiff.lib -D
> TIFF_INCLUDE_DIR=$DIRECTORIO_INSTALACION/tiff/RELEASE/include -D
> CMAKE_INSTALL_PREFIX=$DIRECTORIO_INSTALACION_LIBRERIA/RELEASE ../..
> nmake
> nmake install
>
> Everything seems to be working perfectly. The only issue has to do with
> the message:
>
>
>
>
> *"-- Found leptonica version: 1.84.0Leptonica was build without TIFF
> support! Disabling TIFF support...-- TIFF support disabled.*
> *"*
>
> I think the message is incorrect because the tesseract.exe executable that
> is generated *needs *zlib.dll, tiff.dll and libpng16.dll (in the same
> folder) to run. :-)
>
> DJuego
>
>
>
> El lunes, 9 de octubre de 2023 a las 7:26:58 UTC+1, zdenop escribió:
>
>> Please provide full logs including installation, configure
>> parameters etc. - not screenshots.
>> Make should you have only one installation of leptonica library
>> May your own test if leptonica is built with tiff.
>> Use release target

Re: [tesseract-ocr] Deserialize Header Failed

2023-10-14 Thread Zdenko Podobny

Hello,

tesseract works out of the box.

What does not work are you users, downloading Tesseract at night and
jumping to Tesseract training. Training requires knowledge and
experience that you will not get by following some random internet
tutorials (most of them are outdated, pretending to be successful, just to
get monetization of their video, blog etc...)

The better approach is to read (tesseract) official documentation, read
this forum, and understand tesseract limitations (yes, as each SW on this
earth it has limitations).
Then you make an informed decision about whether training makes sense or
not. Or ask more experienced users for advice (if you are willing to
provide details of what you are trying to achieve e.g. input images)

Otherwise, you are alone with your problems. And it is not because of the
tesseract.

Zdenko


so 14. 10. 2023 o 12:23 Memeroni  napísal(a):

> Hey folks, I downloaded tesseract tonight and I'm having an issue I can't
> get past. The error output is as follows: Deserialize header failed: ☺
> First document cannot be empty!!
> num_pages_per_doc_ > 0:Error:Assert failed:in file
> ../../../src/ccstruct/imagedata.cpp, line 704
>
> I am using a tif file as my raw image source. I have tried 2 different
> methods of generating the tif file. The first method is taking a screenshot
> with snipping tool, pasting it into gimp and saving as a tif. I also tried
> print screening instead of snipping tool. The second method is taking a
> screenshot with snipping tool, saving as a .png, then converting to .tif
> via ImageMagick commandline. I am creating the box file like so:
>
> tesseract 9.tif 9 makebox
>
> I then editing the box file to make sure it is an accurate representation
> of the characters on the screen. I have also tried creating the box file
> and just leaving it to see if that resolves the issue, it does not. I then
> proceed to create the lstmf file like so:
>
> tesseract 9.tif 9 --psm 6 lstm.train
>
> I then try to run lstmtraining or lstmeval and i get the header error
> every time. I am using version 5.3.3, but I have also tried using v4.1,
> recreating all the files and I still got the same issue. Does anyone know
> why I'm getting this issue, and how to resolve it? About to give up with
> tesseract because this shit does not work out of the box. I am following
> google instructions to a T so I either overlooked something crucial that is
> ruining my lstmf file or this shit just does not work for me. Appreciate
> any help that can be provided.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ff9e7700-ca32-4692-84d1-623ebe353b9dn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xfy00TLxmhPRZs78iLH78qp%2B3Wngs5jN4cH%3DQUE-7WOg%40mail.gmail.com.

Re: [tesseract-ocr] "Leptonica was build without TIFF support! Disabling TIFF support..."

2023-10-08 Thread Zdenko Podobny

Please provide full logs including installation, configure parameters etc.
- not screenshots.
Make should you have only one installation of leptonica library
May your own test if leptonica is built with tiff.
Use release target and not debug.

Zdenko


ne 8. 10. 2023 o 21:56 DJuego Director De Juego 
napísal(a):

> Hi, Tesseract Community. My first message.
>
> Today I built the leptonica project with TIFF support.  See figure
> . The build seems to have been a success.
>
> But when I build Tesseract linking it to Leptonica it doesn't seem to
> detect TIFF support. In fact it claims that leptonica has no support and
> states that it disables its own support.  See picture
> .
>
> What could be going on? Any suggestion?
>
> DJuego
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/e569ecdc-d013-46e6-ac16-c97992ec3420n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8z83MnHK0_VjtVS8f2q_5idMqgje3E%3DvdS1h-SddsMZ2g%40mail.gmail.com.

Re: [tesseract-ocr] Multiple colours text in an image

2023-10-07 Thread Zdenko Podobny

Hello,

this is about image preprocessing/thresholding rather than tesseract...
Please post an example image so tesseract users can test it and suggest
a possible solution.

Zdenko


št 21. 9. 2023 o 13:04 Iago Giné 
napísal(a):

> Hi all,
>
> Is there some option to tell tesseract-ocr that there is text with
> multiple colours, so it detect all the text? For example, in my case, I
> have a pdf with the cover of a book, with yellow background and text both
> in black and also in white. Depending on how I proceed, I get only the text
> in black or the text in white, but not both.
>
> I have only found the next issue, but no answer or anything more :
> https://github.com/tesseract-ocr/tesseract/issues/3078
>
> Thank you for your time!
>
> Iago
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6610a558-975c-4ce4-8bba-c2b56fd9c50an%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xxxAeyO3Q85hvkLz7Mu7QkaOqN3dUFSWkOfJOygMW0xw%40mail.gmail.com.

Re: [tesseract-ocr] quality of recognition of customer invoices

2023-09-22 Thread Zdenko Podobny

I know there are (were) people at the forum that implemented Tesseract as
part of invoice processing - but as a commercial solution.

It is not as easy as it looks: there is a need for a custom solution for
text detection (e.g. skipping logos and other graphics, possible
handwriting). As far as I remember they created a new engine for amount
recognition - this is the most critical part of invoice processing.

A few years ago I had a discussion with a professional provider of such
services in Europe (they did not use Tesseract) and they informed me they
try to avoid data extraction from invoices and they insist on invoice data
exchange because it is cheaper and more reliable...

Just my 2 cents - what you can expect or what problems you will need to
solve.

Zdenko


pi 22. 9. 2023 o 14:24 A Nederpelt  napísal(a):

> Well i have approximatelly 3000 customers at the moment for our software.
> We are using lots of invoices to OCR i.e. 1 customer uses approx 10.000
> documents a month.
> So opensource is worth it. I want tesseract, sinds it is free to use.
> I believe opensource is the future.
>
> So, can somebody help me optimize it.
>
> With lots of CPU usage i mean when it needs to use more CPU for some
> parameter like "super quality". I want to use that parameter.
>
> Op vrijdag 22 september 2023 om 14:03:53 UTC+2 schreef desal...@gmail.com:
>
>> The CPU usage is unusual. I have pretty old mac (from 2011); have been
>> running Tesseract quite fine.
>> But, as to the accuracy, if your project is limited in scale, the
>> commercial tools would definitely perform better for you. But, if you have
>> long lasting, and extensive projects, Tesseract is worth spending your time
>> and developing (training) it.
>>
>>
>> On Friday, September 22, 2023 at 2:50:50 PM UTC+3 powe...@gmail.com
>> wrote:
>>
>>> Well, the problem is that why it chooses for:
>>> NLOO790B01
>>> [image: Lambregts0001 - cleaned - btwnr.jpg]
>>> 2 times character O and 5 times a 0 (ZERO)
>>>
>>> Google vision result: "NL00790B01"
>>>
>>> Nuance / OMNIPage: "NL00790B01"
>>>
>>> Leadtools demo: "NL00790B01"
>>>
>>> I want too use Tesseract, but i guess i need things like "second pass"
>>> or "preprocessing", no dictionary etc.etc.etc
>>> So, i more like a CPU usage of 99,99% and not superspeed.
>>>
>>> Can somebody help me ?
>>>
>>> Op vrijdag 22 september 2023 om 13:25:21 UTC+2 schreef
>>> desal...@gmail.com:
>>>
 Apparently, version 4 doesn't support white listing.
 https://groups.google.com/g/tesseract-ocr/c/IBbQIQpdSpE
 That is not good.
 On Friday, September 22, 2023 at 2:23:39 PM UTC+3 Des Bw wrote:

> The difference between zero and O is deeply problematic, for the human
> eye. Some fonts make it even harder.
> You can try the method used here:
> https://pyimagesearch.com/2021/09/06/whitelisting-and-blacklisting-characters-with-tesseract-and-python/
> if that helps.
> On Friday, September 22, 2023 at 9:43:51 AM UTC+3 powe...@gmail.com
> wrote:
>
>> I found the parameters
>> "C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 -
>> cleaned.jpg" "Lambregts0001 - cleaned.txt" -c
>> tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
>> :@."
>> It is not working. "uw BTW nummer:: NLOO790B01"
>>
>> Any other ideas ?
>>
>> Op donderdag 21 september 2023 om 22:25:12 UTC+2 schreef
>> elvi...@gmail.com:
>>
>>> White list the digits so that the O will not confuse it.
>>>
>> You can also try --psm 13 if all of your texts are single line.
>>>
>>
>>> On Thu, Sep 21, 2023, 4:07 PM A Nederpelt  wrote:
>>>
 Hi.
 I am trying to use the tesseract engine instead of the nuance
 engine.
 When i currently use tesseract.exe the image it returns a few
 strange characters.
 2x OO instead of 00
   "uw BTW nummer:: NLOO790B01"
 instead of
   "uw BTW nummer:: NL00790B01"
 and
 "Tel £01"
 instead of
 "Tel : 01"
 but "Tel : 0168-452452" is recognized ok.

 I see no optimization using
 https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md
 because it are really clean documents.

 Am i missing some parameters ? Like a second run, or more accurate
 run etc.
 Maybe compile tesseract.exe myself with different more quality
 parameters ?

 Thanks,
 Alwin

 --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it,
 send an email to tesseract-oc...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8

Re: [tesseract-ocr] how to manual install tesseract-ocr all code include third library code build without cmake in windows

2023-09-21 Thread Zdenko Podobny

Why do you what to compile tesseract?

Zdenko


št 21. 9. 2023 o 15:26 Phoenix Tree  napísal(a):

> i am noob.
>
> some limit in my windows machine ,
> I can't have network, I must manual download tesseract-ocr all code
> include third library code
> can't use cmake
> but can write python script
> and  only have gcc  version '11.2'
>
> how  to compile  in this situation ?
> help me
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/1318829a-8728-48d4-88ec-00d92f849073n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wwLXaViDiWro8kY53W9-D%2BNgeGe-rRDHue2pRASCo9uA%40mail.gmail.com.

Re: [tesseract-ocr] Tesseract Custom Model Not Recognized after Training

2023-09-18 Thread Zdenko Podobny

Unfortunately you hid all important information (e.g. how did you run
training? how did you run tesseract (including tesseract options, exact
command or code,...)? , so just some hints:

> Error: LSTM requested, but not present!!

This implies that the requested traineddata file does not contain needed
LSTM components.

Loading tesseract. Error: Tesseract (legacy) engine requested, but
> components are not present in /usr/share/tesseract-ocr/4.00/
> tessdata/ocrtensor.traineddata!!

This implies that the requested traineddata file does not contain needed
legacy components.

I never saw these 2 messages together. Typically people either follow some
old outdated tutorial and train tesseract legacy components or train for
LSTM engine (without legacy components), but ask tesseract to use legacy
engine...
Based on this I guess your ocrtensor.traineddata is not a valid tesseract
file.

Zdenko


ne 17. 9. 2023 o 17:41 demian kim  napísal(a):

> Body:
>
> Hello Tesseract Community,
>
> I am facing a challenge with my custom-trained Tesseract model, and I'm
> hoping for some guidance on resolving this issue.
>
> Background:
>
>1. I've successfully trained a custom model (ocrtensor.traineddata).
>2. The training finished without any error and I've copied the
>generated .traineddata file to /usr/share/tesseract-ocr/4.00/tessdata/.
>3. I'm trying to use this model in a Jupyter Notebook container with
>the pytesseract Python package.
>
> Problem:
>
> Even though the model was working fine previously, I am now encountering
> an error when trying to use the model. The error suggests that Tesseract
> can't initialize with the custom model:
> vbnetCopy code
> TesseractError: (1, "Error: LSTM requested, but not present!! Loading
> tesseract. Error: Tesseract (legacy) engine requested, but components are
> not present in
> /usr/share/tesseract-ocr/4.00/tessdata/ocrtensor.traineddata!! Failed
> loading language 'ocrtensor' Tesseract couldn't load any languages! Could
> not initialize tesseract.")
>
> Steps Tried:
>
>1. Ensured the Tesseract version compatibility (using version 4).
>2. Checked file permissions (even tried with chmod 777).
>3. Restarted Jupyter Notebook container multiple times.
>4. Tried executing Tesseract from the terminal directly.
>5. Made sure the TESSDATA_PREFIX environment variable is set correctly.
>6. Tried Tesseract with logging enabled for additional error details.
>
> I'm unsure why the model suddenly isn't recognized when it was working
> just a while ago. If anyone has insights or suggestions on what might be
> going wrong, I would greatly appreciate it.
>
> Thank you for your assistance.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/eac448cf-79f3-4b41-9400-397710fb43c7n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wUNPLbMy4jXGDgYER3bEAsUfKLUwfb8hnSJ-CMLSvtdw%40mail.gmail.com.

Re: [tesseract-ocr] Strange behaviour of Tesseract

2023-09-14 Thread Zdenko Podobny

>
> Is it still broken in version 5? The thread you posted is from 2017!

[image: image.png]


Zdenko


št 14. 9. 2023 o 17:10 Gilad Pellaeon  napísal(a):

> Is it still broken in version 5? The thread you posted is from 2017!
>
> One thing I noticed in the meantime: I stored my PNGs with paint.net and
> auto bit depth recognition. The image is grayscale. So the image wasn't
> stored in 32bit.
> Now, I forced to save the image 32bit color depth and now it works. So I
> assume it's a bug regarding the bit depth handling. This also explains the
> memory access violation althrough the pixel range doesn't violet the image
> pixel size. If the system internally assumes a bigger depth then it also
> wants to process bigger memory chunks.
>
> Is this problem known? Otherwise I can create a bug report.
>
>
> Best regards
>
> zdenop schrieb am Donnerstag, 14. September 2023 um 16:52:58 UTC+2:
>
>>
>> https://github.com/tesseract-ocr/tesseract/issues/845
>>
>> Zdenko
>>
>>
>> št 14. 9. 2023 o 16:49 Gilad Pellaeon  napísal(a):
>>
>>> Hi,
>>>
>>> I am new to Tesseract. I searched for an OCR library, found Tesseract
>>> and now I want to use it for a specific measure protocol.
>>>
>>> I built Tesseract 5.3.2 from source and the dependencies leptonica-1.83,
>>> libpng and OpenJPEG for Windows with the Latex Visual C++ compiler for
>>> Windows, x64.
>>>
>>> Then I did some first tests based on the examples from the documentation
>>> ( *Basic_example *and *SetRectangle_example*). As data set I use
>>> *eng.traineddata* from the *testdata_best* repo.
>>>
>>> Now, I have a behaviour which I can't classify. I tried to recognize a
>>> float value in a given rectangle (with *SetRectangle * ). Tesseract
>>> didn't converted it (empty return). Then I manually copied the rectangle
>>> and saved it in a new file (see attached Single_Number.png). Then I tried
>>> this file without the *SetRectangle *call*. *Now it works.
>>>
>>> The attached* Protocol_table.png *is the original image, but I removed
>>> all other stuff in the picture. So it's empty except the number at the
>>> original position. Now I have the following behaviour: in DEBUG mode the
>>> conversion works, in RELEASE mode not.
>>>
>>> I also tried to slighty enlarge the rectangle area (see last
>>> SetRectangle call in the code below). But now I got a runtime exception.
>>> The resolution of the picture is 2625x1682. So there should be no buffer
>>> overflow?!
>>>
>>> Am I doning something wrong here? Or what's the problem for this
>>> behaviour?
>>>
>>> This is my basic code:
>>>
>>> //std includes
>>> #include 
>>>
>>> //tesseract includes
>>> #include "tesseract/baseapi.h"
>>>
>>> //Leptonica includes
>>> #include "allheaders.h"
>>>
>>>
>>>
>>> //!
>>> int main()
>>> {
>>> tesseract::TessBaseAPI api;
>>> // Initialize tesseract-ocr with English, without specifying
>>> tessdata path
>>> if (api.Init(nullptr, "eng"))
>>> {
>>> std::cout << "Could not initialize tesseract." << std::endl;
>>> return 1;
>>> }
>>>
>>> //
>>> Pix* image =
>>> pixRead("D:/projects/cpp/Tesseract-Test/Protocol_Table.png");
>>> //Pix* image =
>>> pixRead("D:/projects/cpp/Tesseract-Test/Single_Number.png");
>>> api.SetImage(image);
>>> // Restrict recognition to a sub-rectangle of the image
>>> // SetRectangle(left, top, width, height)
>>> api.SetRectangle(807, 1393, 93, 49);
>>> //api.SetRectangle(707, 1293, 193, 149);
>>> // Get OCR result
>>> char* outText = api.GetUTF8Text();
>>> if (outText)
>>> printf("OCR output:\n%s", outText);
>>>
>>> // Destroy used object and release memory
>>> api.End();
>>> }
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/626e06c6-ea15-45d2-86da-1bba6c069e1cn%40googlegroups.com
>>> 
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d262537a-9065-4876-b435-071a8e596745n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.

Re: [tesseract-ocr] Strange behaviour of Tesseract

2023-09-14 Thread Zdenko Podobny

https://github.com/tesseract-ocr/tesseract/issues/845

Zdenko


št 14. 9. 2023 o 16:49 Gilad Pellaeon  napísal(a):

> Hi,
>
> I am new to Tesseract. I searched for an OCR library, found Tesseract and
> now I want to use it for a specific measure protocol.
>
> I built Tesseract 5.3.2 from source and the dependencies leptonica-1.83,
> libpng and OpenJPEG for Windows with the Latex Visual C++ compiler for
> Windows, x64.
>
> Then I did some first tests based on the examples from the documentation ( 
> *Basic_example
> *and *SetRectangle_example*). As data set I use *eng.traineddata* from
> the *testdata_best* repo.
>
> Now, I have a behaviour which I can't classify. I tried to recognize a
> float value in a given rectangle (with *SetRectangle * ). Tesseract
> didn't converted it (empty return). Then I manually copied the rectangle
> and saved it in a new file (see attached Single_Number.png). Then I tried
> this file without the *SetRectangle *call*. *Now it works.
>
> The attached* Protocol_table.png *is the original image, but I removed
> all other stuff in the picture. So it's empty except the number at the
> original position. Now I have the following behaviour: in DEBUG mode the
> conversion works, in RELEASE mode not.
>
> I also tried to slighty enlarge the rectangle area (see last SetRectangle
> call in the code below). But now I got a runtime exception. The resolution
> of the picture is 2625x1682. So there should be no buffer overflow?!
>
> Am I doning something wrong here? Or what's the problem for this behaviour?
>
> This is my basic code:
>
> //std includes
> #include 
>
> //tesseract includes
> #include "tesseract/baseapi.h"
>
> //Leptonica includes
> #include "allheaders.h"
>
>
>
> //!
> int main()
> {
> tesseract::TessBaseAPI api;
> // Initialize tesseract-ocr with English, without specifying tessdata
> path
> if (api.Init(nullptr, "eng"))
> {
> std::cout << "Could not initialize tesseract." << std::endl;
> return 1;
> }
>
> //
> Pix* image =
> pixRead("D:/projects/cpp/Tesseract-Test/Protocol_Table.png");
> //Pix* image =
> pixRead("D:/projects/cpp/Tesseract-Test/Single_Number.png");
> api.SetImage(image);
> // Restrict recognition to a sub-rectangle of the image
> // SetRectangle(left, top, width, height)
> api.SetRectangle(807, 1393, 93, 49);
> //api.SetRectangle(707, 1293, 193, 149);
> // Get OCR result
> char* outText = api.GetUTF8Text();
> if (outText)
> printf("OCR output:\n%s", outText);
>
> // Destroy used object and release memory
> api.End();
> }
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/626e06c6-ea15-45d2-86da-1bba6c069e1cn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xpfj0xkzvevkXkvU%2BO8PS_XzGMdkZavx%2Bw411M74fvnw%40mail.gmail.com.

Re: [tesseract-ocr] Normalization failed for string

2023-09-14 Thread Zdenko Podobny

unicharset is created automatically (by official training procedure
https://github.com/tesseract-ocr/tesstrain)


Zdenko


št 14. 9. 2023 o 13:56 Ali hussain  napísal(a):

> I have faced in my own trianed_text this normalization error. I think the
> main problem is * ্য*in these words. and i did't find*  ্য*  in
> ben.unicharset file. I think this is the reason for the show error.
> if I create a unicharset for * ্য  *and add in ben.unicharset file it
> will work?
> I don't know how to create a unicharset for this * ্য  * like look at
> these words you can understand better. thx
>
> ব্যাটারির
> র‌্যাবের
> র‌্যাঙ্কিংয়েও
> হ্যাকাররা
>
> *This is the main error.: *
> Extracting unicharset from plain text file data/ben/all-gt
> Invalid start of grapheme sequence:D=0x981
> Normalization failed for string 'পারে মটোরোলার গবেষকেদের তৈরি বিশেষ এ উলকি
> ত্বকের ওপর আঁঁকা এক ধরনের সার্কিটের মতো এতে কোনো ব্যাটারির প্রয়োজন পড়ে না'
> Dropping isolated joiner: 0x200c
> Invalid start of grapheme sequence:H=0x9cd
> Dropping isolated joiner: 0x200c
> Invalid start of grapheme sequence:H=0x9cd
> Normalization failed for string 'হবে এসব স্থানে মোটরসাইকেল নিয়ে ও হেঁটে
> র‌্যাবের দল টহল দেবে র‌্যাবের পোশাকধারী সদস্যের পাশাপাশি সাদা পোশাকে
> গোয়েন্দা'
> Dropping isolated joiner: 0x200c
> Invalid start of grapheme sequence:H=0x9cd
> Normalization failed for string 'র‌্যাবের এক বিজ্ঞপ্তিতে এ তথ্য জানানো হয়
> রমজান মাসে আর্থিক লেনদেন বেড়ে যাওয়ায় ছিনতাই চাঁদাবাজির মতো সন্ত্রাসী
> কর্মকাণ্ড রোধে'
> Dropping isolated joiner: 0x200c
> Invalid start of grapheme sequence:H=0x9cd
> Normalization failed for string 'কার্যক্রম জোরদার করা হবে এ ব্যাপারে
> র‌্যাবের গণমাধ্যম শাখার পরিচালক উইং কমান্ডার এ টি এম হাবিবুর রহমান প্রথম
> আলো'
> Dropping isolated joiner: 0x200c
> Invalid start of grapheme sequence:H=0x9cd
> Normalization failed for string 'বড় ব্যবধানে হারানোর পর এখন বিশ্বকাপ
> জয়ের স্বপ্নে বিভোর ব্রাজিলের সমর্থকেরা ফুটবলে ব্রাজিলিয়ান উত্থানের
> প্রতিধ্বনি শোনা যাচ্ছে ফিফার র‌্যাঙ্কিংয়েও'
> Dropping isolated joiner: 0x200c
> Invalid start of grapheme sequence: H=0x9cd
> Normalization failed for string 'নয় নম্বরে গত বছরের জুলাই থেকে শুরু
> হয়েছিল ফিফা র‌্যাঙ্কিংয়ে ব্রাজিলের অবনমন স্বাগতিক হওয়ার সুবাদে বিশ্বকাপ
> বাছাই পর্ব খেলতে'
> Dropping isolated joiner: 0x200c
> Invalid start of grapheme sequence: H=0x9cd
> Dropping isolated joiner: 0x200c
> Invalid start of grapheme sequence: H=0x9cd
> Normalization failed for string 'এসব পদক্ষেপ নিয়েছে র‌্যাব নিরাপত্তা
> পরিকল্পনার অংশ হিসেবে অন্য আইনশৃঙ্খলা বাহিনীর পাশাপাশি র‌্যাবও নিজস্ব
> দায়িত্বপূর্ণ এলাকায় তিন ধাপে'
> Dropping isolated joiner: 0x200c
> Invalid start of grapheme sequence: H=0x9cd
> Normalization failed for string 'ঠেকাতে র‌্যাবের পদক্ষেপের কথা উল্লেখ করেন
> উইং কমান্ডার হাবিবুর রহমান এ ব্যাপারে তিনি বলেন বাস রেল লঞ্চ কাউন্টার ও'
> Dropping isolated joiner: 0x200c
> Invalid start of grapheme sequence: H=0x9cd
> Dropping isolated joiner: 0x200c
> Invalid start of grapheme sequence: H=0x9cd
> Normalization failed for string 'নিয়ন্ত্রণে রাখতে অন্যবারের মতো এবারের
> রমজান মাসেও দেশজুড়ে বাড়তি নিরাপত্তা ব্যবস্থা নিয়েছে র‌্যাপিড একশন
> ব্যাটালিয়ন র‌্যাব আজ বৃহস্পতিবার'
> Invalid start of grapheme sequence: M=0x9be
> Invalid start of grapheme sequence: M=0x9be
> Invalid start of grapheme sequence: M=0x9be
> Normalization failed for string 'ফিশিং এটাক বলে এ ছাড়া ডিকশনারি এটাক বা
> সহজে অনুমান করা যায় এমন শব্দনির্ভর পাসওয়াার্ডগুলো দিয়েও আক্রমণ করে
> হ্যাকাররা গবেষকেরা'
> Dropping isolated joiner: 0x200c
> Invalid start of grapheme sequence: H=0x9cd
> Normalization failed for string 'ফাইনালে ব্রাজিলের কাছে হেরে কনফেডারেশনস
> কাপের শিরোপাটা অধরা থেকে গেলেও ফিফা র‌্যাঙ্কিংয়ের শীর্ষস্থানটা হারাতে
> হয়নি স্পেনকে ১৫৩২ পয়েন্ট নিয়ে'
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/0fa2828c-0791-4f2f-9c69-a772cc688bean%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xTXPZ9LcuYUuHs8_4YV7jOFGX8RWOFL-dg7MVfp7A-dA%40mail.gmail.com.

Re: [tesseract-ocr] Preprocess screenshot image before tesseract.

2023-08-29 Thread Zdenko Podobny

Please do not send it to the mailing list compressed images (rar, zip).
Post them somewhere or use appropriate image format to decrease their size
(renaming bmp file to png does not work)

Zdenko


ut 29. 8. 2023 o 9:01 Ajay Pandya  napísal(a):

> Hello Everyone,
>
> Can anyone help me with the better pre-processing techniques before
> sending images to Tesseract. Adding some images here to get the better
> understanding of my problem.
>
> Thanks in advance.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/b8a19342-7623-4a49-af2e-8621688637e8n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wRbgScP%2BtvsVMa-%2BYV-j8Z%3DqqMyRWgpD4HZ0a-gwip4g%40mail.gmail.com.

Re: [tesseract-ocr] Whitelist is not accepting special characters

2023-08-27 Thread Zdenko Podobny

IMO there is not need to use psm and whitelist:

tesseract text.png - -l fast/script/Latin
Estimating resolution as 274
Ñato ñelo ñaña álca moño

Ñoko niño niña chillňa élif

For Windows I guess there could be a problem with UTF-8 in the terminal...

Zdenko


ne 27. 8. 2023 o 6:25 Shadya S.  napísal(a):

> I'm using Tesseract (version 5.3.1) in Windows to recognize characters
> from a text that includes special characters like ñüá. Most of these
> characters are within the Latin script, so I've declared this in the
> command line.
>
> In this image, the special characters are ñ,Ñ,á,é.
> [image: text.png]
>
> The command line I'm using is
> * tesseract text.png stdout --psm 6 -l Latin -c
> tessedit_char_whitelist=aáeéiocfhklmnñtÑ*
>
> However, the output text is missing white spaces between words, and the
> special characters are being completely ignored, resulting in:
> *aoloaalcalmoo*
> *okonioniachillalif *
>
>
> Do you know why tesseract is not taking into account the characters I've
> declared in the whitelist? Maybe I'm not correctly specifying the special
> characters
>
> Any help is greatly appreciated.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/843a1439-45ba-422c-8ba8-40fa557938b3n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wN2R0MOBV7HNg%2BZYbXYU%2BfpPhKkKNgM0t4J0saWPm%2Bug%40mail.gmail.com.

Re: [tesseract-ocr] Suggestions for Windows 10 x64 build issue

2023-08-20 Thread Zdenko Podobny

Maybe you should provide a simple test case for replicating the problem
including information on how did you build tesseract&leptonica).

E.g. for SetRectangle_test.cpp (from
https://groups.google.com/g/tesseract-ocr/c/PMHq6YSpRRE/m/Z2DCrgQlAAAJ)
links without problem for me:

cl /EHsc SetRectangle_test.cpp /std:c++17 /I F:/win64/include /link
/LIBPATH:F:\win64\lib leptonica-1.84.0.lib tesseract53.lib
Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30147 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

SetRectangle_test.cpp
Microsoft (R) Incremental Linker Version 14.29.30147.0
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:SetRectangle_test.exe
/LIBPATH:F:\win64\lib
leptonica-1.84.0.lib
tesseract53.lib
SetRectangle_test.obj


Zdenko


pi 18. 8. 2023 o 22:55 CraigLandrum  napísal(a):

> Our document scanning/document management app makes use of the tesseract
> library. We have a single .cpp the contains "glue" code to do things like
> clip areas of an image and adjust image depth and resolution before
> handling it off to SetImage and SetRectangle in tesseract.  In version 3.05
> of tesseract, we did this by generating static versions of the leptonic and
> tesseract libraries and then linking them in a VS project with our glue
> code, creating a DLL that contains the linked glue code, tesseract lib, and
> leptonica lib with the tessdata folder in a zip file as a resource which is
> unpacked/unzipped at initialization.  This has worked like a champ from
> version 2.x through 3.05.  I'm now using VS 2022 on Windows 10 x64 and I've
> created the latest leptonica (1.83.1) and tesseract (5.3.1) as static
> libraries, but when I try and link with our glue code, I get a lot (908 to
> be exact) of LNK2005 and other weird errors.  The glue code is C++ and we
> are including tesseract's baseapi.h and a few other header files.  Has
> anyone else tried to do this on Windows 10 x64 and VS 2022? My Windows guru
> thinks it stems from including baseapi.h in both the glue code and the
> tesseract lib. Is there some obscure flag I can set in VS 2022 to tell it
> to ignore duplicately defined symbols (i.e. LNK2005)?  FYI, I did this
> successfully on a 64-bit Mac M1machine with no problem, so I suspect it is
> simply my ignorance of the VS 2022 options that is my issue on Windows.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/bf30d143-557c-4d3f-ab9a-22e4396fde2cn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wwF4RhQ%2BNk8z9Sz6sdCZmJLdLPMcaNAQXDs8aP%2B-R9UA%40mail.gmail.com.

Re: [tesseract-ocr] Question reg. Telugu ; char missing in ocr ; how to fix ?

2023-08-17 Thread Zdenko Podobny

Please provide details of what are you doing including details of Tesseract
version, OS, and which tessdata you used...)

Make sure you read tesseract documentation and please provide also details
on which suggested solution you used and which char is missing (as not
everybody is familiar with Telugu)

Zdenko


pi 11. 8. 2023 o 19:07 ravi kumar  napísal(a):

> Hi ,
> New  to this program.. not  sure how  and where to start  to fix..
> i have  a image attached   that is used for testing Tesseract  and H-ocr
> file  for trace on missing char ; can  someone interpret   and guide me to
> the fix.
>
> TIA,
> Ravi Kumar.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/cf266779-e08c-4d8c-b970-738d2ad48084n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xuhWXcatxHvMrwMg53yOcPkcoJt4zy1jU%3D2-jYMCYYnw%40mail.gmail.com.

Re: [tesseract-ocr] only english language is recoganising

2023-08-17 Thread Zdenko Podobny

We are sorry, but we have no clue what are you doing.
Please provide the details for replicating your problem.

Zdenko


so 12. 8. 2023 o 20:25 V S KARTHIK  napísal(a):

> Hi,
>  malaylam or any other language is not extracting from image why?anybody
> knows?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/3796ae05-b2d7-447f-8af8-911180e5e9e1n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wRSO4AV93-KrvaFirh9CtUeKPLTrjjoEegMLuuuiq0gw%40mail.gmail.com.

Re: [tesseract-ocr] SetRectangle change?

2023-08-01 Thread Zdenko Podobny

Yes, there is a problem with SetRectangle or there is a mismatch between
other API functions (e.g. GetThresholdedImage).
It could be demonstrated with the attached simple code.

According to API [1] SetRectangle(left, *top*, width, height) e.g.
SetRectangle(left, top, width, height *.3)  should OCR the first 30% of the
image. Indeed GetThresholdedImage provides it correctly.
But GetUTF8Text() OCRed "last" 30% of the image (e.g. it acts like
SetRectangle(left, *bottom*, width, height)

IMO safer solution is to use the cropped image for SetImage.

[1]
https://github.com/tesseract-ocr/tesseract/blob/0768e4ff4c21aaf0b9beb297e6bb79ad8cb301b0/include/tesseract/baseapi.h#L340


Zdenko


ut 1. 8. 2023 o 20:40 CraigLandrum  napísal(a):

> We use tesseract in our document imaging app - first started with version
> 2.x and recently upgraded from 3.05 to 5.3.1, and something broke.  We
> supply images to tesseract using SetImage and then SetRectangle.  In one of
> our apps, we often OCR the top third of invoices to gather info on a
> vendor.  This worked fine in 3.05 but not in 5.3.1.  If I specify the full
> image dimensions in SetRectangle (as provided to SetImage), all works fine,
> but if I specify dimensions in SetRectangle to just do the top third of the
> image, I get total garbage back. We are providing one-bit B&W images to
> SetImage (white = 1)and specify the target area in pixels. Something
> changed between 3.05 and 5.3.1 to make this not work.  Is there something I
> missed in the interim?  Perhaps SetRectangle(x,y,w,h) wants dimensions that
> start on 8-bit bounds or something equally restrictive?  Any suggestions
> welcome.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/3959f739-c152-4526-93bc-3ea63b9e088an%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zfPC8jquGvFWS0q%2BPV05Awr0a5FzwWev8VGFvAG-F-UA%40mail.gmail.com.
/*
invoice.png -> 
https://images.ctfassets.net/lzny33ho1g45/5HzGPfsoZo3g7klt0Aww6X/89adc1672b7872667eb5f781adeccfac/fcb74faee4c0576ceaacf82777f6bc93__1_.png?w=1400
*/

#include 
#include 

int main() {

  tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
  api->Init(NULL, "eng");

  Pix *image = pixRead("invoice.png");
  api->SetImage(image);
  int w = pixGetWidth(image);
  int h =  pixGetHeight(image);
  // int h_adj = h - h * .7;
  int h_adj = h * .3;
  api->SetRectangle(0, 0, w, h_adj);
  char *outTextSR = api->GetUTF8Text();
  printf("\tOCR output after SetRectangle:\n%s", outTextSR);
  Pix *rect_pix = api->GetThresholdedImage();
  pixWrite("ocred_pix.png", rect_pix, IFF_PNG);

  api->SetImage(rect_pix);
  char *outTextSI = api->GetUTF8Text();
  printf("\n\tOCR output SetImage:\n%s", outTextSI);

  api->End();
  pixDestroy(&image);
  pixDestroy(&rect_pix);
  delete[] outTextSR;
  delete[] outTextSI;
  delete api;
  return 0;
}

Re: [tesseract-ocr] Tesseract-ocr in quiet mode

2023-07-23 Thread Zdenko Podobny

It is not a tesseract problem but the VB. Prove for this you can find in
pytesseract that call tesseract executable without console windows.

Zdenko


ne 23. 7. 2023 o 15:55 nor s  napísal(a):

> Is there a way to have Tesseract run without producing a Dos window? I'm
> incorporating a call to Tesseract-ocr in my VB.net application to read some
> date info from an image. Each time I  execute Tesseract I get a dos window
> popping up.  I'm on windows 10 and Tesseract 5.0
> Thanks
>  Nor
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/069f46f9--4b71-85f8-62dd28b77702n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8ynd%3DDY_5pXZ%2BMtxtHyEeOHfhfCVNn04ezut1ORckg1Zg%40mail.gmail.com.

Re: [tesseract-ocr] missing tesseract_opencl_profile_devices.dat (or how to disable OpenCL)

2023-07-16 Thread Zdenko Podobny

It is incompetent and irresponsible to use an experimental code in
production/distribution.


Zdenko


ne 16. 7. 2023 o 21:13 Markus Leuthold 
napísal(a):

> It looks like OpenSuse TW builds the package with "--enable-opencl"
>
> https://build.opensuse.org/package/view_file/openSUSE:Factory/tesseract-ocr/tesseract-ocr.spec
> Would you consider this a bad choice and prefer to build without opencl
> for a large distribution such as OpenSuse?
>
>
> On Sunday, 16 July 2023 at 19:05:20 UTC+2 zdenop wrote:
>
>> There is no possibility to disable OpenCL at run time.
>> OpenCL is disabled by default and marked as experimental, not suggested
>> by the forum/issue tracker, etc.
>> It is there (as compile option) only as a startup point for possible
>> developers.
>>
>> Zdenko
>>
>>
>> ne 16. 7. 2023 o 17:21 Markus Leuthold 
>> napísal(a):
>>
>>> I'm using tesseract-ocr built by OpenSuse TW. For each ocr run, an error
>>> appears about a missing profile file, then a new file
>>> tesseract_opencl_profile_devices.dat is created. The 2nd run with the
>>> profile file is then successful. How do I disable opencl at all (at run
>>> time, not at compile time) or fix the issue with the profile file?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/3a8bbf6a-d52c-4962-94fb-d94717b940b7n%40googlegroups.com
>>> 
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/3debde6f-9623-4402-b8ae-5ad6d5a01280n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w%3DCN82F%2B1rQpMKA52%2B%3D66_MX1ri2EeXuwttG4GiYEaHg%40mail.gmail.com.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1064 matches

Mail list logo