Re: [tesseract-ocr] Tesseract 5 with dnf

2024-09-05 Thread Zdenko Podobny
No. We do not distribute binary packages. Volunteers create and maintain
them.

Zdenko


št 5. 9. 2024 o 20:56 Chris Crutts (agentc313) 
napísal(a):

> on my Oracle Linux 8.10 distribution, doing
>
> $ sudo dnf install tesseract
>
> installs tesseract version 4.1.1-2.el8 and leptonica version 1.76.0-2.el8
>
> As of today, 9/5/2024, the newest version is Release 5.4.1 ·
> tesseract-ocr/tesseract (github.com)
> <https://github.com/tesseract-ocr/tesseract/releases/tag/5.4.1>
>
> I am curious as to why the newest version able to be installed via dnf is 
> Release
> 4.1.1 <https://github.com/tesseract-ocr/tesseract/releases/tag/4.1.1> which
> was released late 2019.
>
> I found that you can install from source, or by using the Snap Store
> <https://snapcraft.io/install/tesseract/rhel>, but I want to use dnf.
>
> Are there any plans to update the dnf package in the future?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/7db00879-c247-4065-b5d8-e8220d84826cn%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/7db00879-c247-4065-b5d8-e8220d84826cn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zL2-BUrtaBqMCMdDEx7MZScMucrhA2buhypsiznD7baw%40mail.gmail.com.


[tesseract-ocr] Tesseract 5 with dnf

2024-09-05 Thread Chris Crutts (agentc313)
on my Oracle Linux 8.10 distribution, doing

$ sudo dnf install tesseract

installs tesseract version 4.1.1-2.el8 and leptonica version 1.76.0-2.el8

As of today, 9/5/2024, the newest version is Release 5.4.1 · 
tesseract-ocr/tesseract (github.com) 
<https://github.com/tesseract-ocr/tesseract/releases/tag/5.4.1>

I am curious as to why the newest version able to be installed via dnf is 
Release 
4.1.1 <https://github.com/tesseract-ocr/tesseract/releases/tag/4.1.1> which 
was released late 2019.

I found that you can install from source, or by using the Snap Store 
<https://snapcraft.io/install/tesseract/rhel>, but I want to use dnf.

Are there any plans to update the dnf package in the future?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7db00879-c247-4065-b5d8-e8220d84826cn%40googlegroups.com.


[tesseract-ocr] Tesseract not working for some single examples.

2024-07-30 Thread Filip Bry


I'm trying to use a tesseract in project wrote in C#. I have a problem with 
reading text from a part of an image. I'm trying to find this 4 signs (in 
example ) and number after "e". Additionally, for some examples it is 
working perfectly but for some others its printing "Empty page!!!". 
Difference between examples is color of the background but whole image 
processing is the same for every try. What should I do to minimize 
probability of error?


Thats the image where ocr is working correctly:
[image: working.jpg]

and here is not working: 

[image: not working.jpg]



Part of code in c#:


public static class Sign
{
public static void Verify()
{
string imagePath = "path.bmp";
Mat imageSign = new Mat(imagePath);

int h = imageSign.Rows;
int w = imageSign.Cols;
int point1 = (int)(0.01 * w);
int point2 = (int)(0.6 * h);
int point3 = (int)(0.3 * w);
int point4 = (int)(0.9 * h);
OpenCvSharp.Point start_point = new OpenCvSharp.Point(point1, 
point2);
OpenCvSharp.Point end_point = new OpenCvSharp.Point(point3, point4);
imageSign = new Mat(imageSign, new OpenCvSharp.Rect(point1, point2, 
point3 - point1, point4 - point2));
Cv2.Resize(imageSign, imageSign, new OpenCvSharp.Size(), 2, 2);
imageSign.SaveImage(imagePath);

using (Bitmap bitmap = (Bitmap)Image.FromFile(imagePathE))
{
using (Bitmap newBitmap = new Bitmap(bitmap))
{
string imagePathA = "2nd image path.bmp";
newBitmap.SetResolution(300, 300);
newBitmap.Save(imagePathA);
}
}




string imagePathB = " "2nd image path.bmp " ;
var pixFromFile = Pix.LoadFromFile(imagePathB);
string customConfig = "--psm 10 --oem 3";
using (var engine = new TesseractEngine(@"C:\Program 
Files\Tesseract-OCR\tessdata", "eng", EngineMode.Default))
{

engined.SetVariable("tessedit_char_whitelist", "0123456789");
using (var page = engined.Process(pixFromFile, customConfig))
{
string text = page.GetText();
Console.Write(text);

string[] lines = text.Split('\n');
bool linijka = false;

foreach (string line in lines)
{
if (line.Length == 4 || line.Length == 5)
{
Console.WriteLine("Oznaczenie e5: ");
Console.WriteLine(line);
linijka = true;
}
if (line.Length == 1)
{
Console.WriteLine("e_:");
Console.WriteLine(line);
}
}

   
Cv2.ImShow("koniec", imageSign);
Cv2.WaitKey(0);
}
}

I tried cropping an image and for some reason when i making it bigger or 
smaller than it is now, it adversely affects on results. Additionally I 
tried some other tesseract psm configurations and changed dpi of image to 
300.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6ba5367b-235d-4608-9ba6-65c2a2a5eef9n%40googlegroups.com.


[tesseract-ocr] Tesseract training ground truth: I'm confused about the box files

2024-07-10 Thread Mateusz Matela
Hi all,

Sorry if double posting, my previous message didn't appear and I don't see 
any info about waiting for acceptance or something.
I was searching for this topic in this forum and it was mentioned a few 
times, but I couldn't find a clear and definitive explanation.

How does the information put in the .box files affect the training process? 
The file contains coordinates for each character in the txt file, but the 
documentation says that since Tesseract 4.0 the model operates on the level 
of whole lines. Some tools like text2image generate the .box files with 
accurate coordinates for each character. When the .box files are missing 
the tesstrain Makefile generates them using generate_line_box.py, which 
assigns the same full image area to each character.

I see 3 possible conclusions, which one is closest to the truth?

1. The .box files do not affect the LSTM training at all and are just a 
leftover from the times of Tesseract 3. In that case, ideally in the future 
they could be completely dropped or only required/generated when 
specifically working with the legacy engine.

2. There is still a chance that training will work better with exact 
coordinates and the generate_line_box.py is just a cheap workaround that 
could be improved on in the future.

3. The .box file is still important in case you prefer to define the 
coordinates for the text in the image instead of cropping the image. The 
granularity of the coordinates is not imporant as Tesseract will just work 
on a box that encapsulates all of the character boxes. Even if confusing, 
this approach is still better than having a different .box file formats for 
LSTM and the legacy engine.

I'll be grateful for any wisdom on this.

Thanks
Mateusz

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a048c18c-048f-44cb-8d1a-dfaf509358e9n%40googlegroups.com.


Re: [tesseract-ocr] Tesseract Output not correct in hindi text.

2024-06-26 Thread Ger Hobbelt
If you want more speed, give tesseract less to work on. Your scenario
sounds like you will have a large number of PDFs, all containing the same
(scanned) form. From the look of this sample, it seems page alignment, etc.
has already been taken care of, so that would allow us to assume that all
those forms (scans), would we stack them all on top of one another, all
look the same, i.e. the data you are looking for is to be found at
predetermined fixed rectangle coordinates within the page.
Create a mask that erases everything else to white, so only the fields of
interest remain and feed that to tesseract. Output TSV or HOCR to get
coordinates alongside the OCRed text and you can reconstruct the fields'
content easily. At least that's the assumption here & now.

The key is: *image preprocessing*
In your case, there's a lot that can be done in that preprocessing stage so
that tesseract has only a few text areas to process in an otherwise white
page.

*Reference material: read it all, as a lot depends on context and you are
the one who can determine whether each item is applicable / may have an
effect in your particular scenario.*

- https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html
- image scaling can have a significant impact; see
https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ
- https://groups.google.com/g/tesseract-ocr/c/B2-EVXPLovQ/m/lP0zQVApAAAJ
(process flow)



Met vriendelijke groeten / Best regards,

Ger Hobbelt

--
web:http://www.hobbelt.com/
http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--


On Wed, Jun 26, 2024 at 8:42 AM lalit joshi  wrote:

> I am trying to build an app where I have to extract some data from pdf
> containing election roll data for the indian constituencies. I have
> attached a sample PDF. Below is the code I am running:-
>
> data = []
> current_page =
> np.array(pdf2image.convert_from_path('/home/spxlpt087/Downloads/New
> folder/2024-FC-EROLLGEN-S07-49-FinalRoll-Revision2-HIN-61-WI.pdf',
>   first_page=3,
>   last_page=3,
>   dpi=300)[0])
> sharpened_image = cv2.filter2D(current_page, -1, kernel_sharpening)
> kernel = np.ones((1, 1), np.uint8)
> img_dilation = cv2.dilate(sharpened_image, kernel, iterations=5)
> gray_img = cv2.cvtColor(img_dilation, cv2.COLOR_BGR2GRAY)
> thr = cv2.threshold(gray_img, 128, 255, cv2.THRESH_BINARY_INV)[1]
> cnts = cv2.findContours(thr, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
> cnts = cnts[0] if len(cnts) == 2 else cnts[1]
> contours_new = ()
> cnts_tables = [cnt for cnt in cnts if cv2.contourArea(cnt) > 5]
> rects = sorted([cv2.boundingRect(cnt) for cnt in cnts_tables], key=lambda
> r: (r[1], r[0]))
> for i_r, (x, y, w, h) in enumerate(rects, start=1):
> cell = current_page[y+1:y+h-1, x+1:x+w-1]
> text = pytesseract.image_to_string(
> cell,
> config='--oem 3 --psm 11',  #--oem 1 --psm 3
> lang='Devanagari+eng',
> nice=1)
> text = text.replace('\f', '')
> text = text.replace('\n\n', '')
> print(text)
> data.append(text)
>
>
> The data I am getting :-
> SOI0798389iनाम: उघापिता का नामः जगदीश चंद्र~मकान नं. : 001आयु : 33 लिंग :
> महिला $011000827[|नाम : सुरेश कुमारपिता का नाम: राजबीर~मकान नं, : 01आयु :
> 21 लिंग : पुरुष MXMI5749203नाम : अशोक कुमारपिता का नाम: रुपचन्दफोटो
> उपलब्धमकान नं, : 79आयु : 39 लिंग : पुरुष $011142009नाम : प्रकाश कौरपति का
> नामः राम लुभायामकान नं. : 10~आयु : 49 लिंग : महिला [ 3]$011145184नाम :
> मोनिकापति का नामः धर्मवीरमकान नं. : 11फोटो उपलब्धआयु : 27 लिंग : महिला
> $011146356नाम : अरसियापिता का नाम: राजकुमारमकान नं. : 11फोटो उपलब्ध हैआयु :
> 18 लिंग : महिल SOI07983637नाम : सुनीलपिता का नामः रामदिया~मकान न॑. : 17आयु
> : 24 लिंग : पुरुष $011146208[9नाम : सुमनपति का नामः सुभाषफोटो उपलब्धमकान
> न॑. : 18आयु : 31 लिंग : महिला $011146133|नाम : सुभाषपिता का नाम: राज
> कुमार~मकान न॑. : 18आयु : 33 लिंग : पुरुष | 10] 0$011141548नाम :
> वीरेंद्रपिता का नाम: रामफलमकान नं. : 19~आयु : 28 लिंग : पुरुष [dl
> 1$011146257नाम : रोहितपिता का नाम: दीपकमकान नं. : 19फोटो उपलब्धंआयु : 20
> लिंग : पुरुष | 12$010958629नाम : सुमनपति का नाम: रामदियामकान नं. : 34फोटो
> उपलब्धआयु : 33 लिंग : महिला SO1092028013 |नाम : वंशीकापिता का नामः
> जितेंद्र~मकान न॑. : 35आयु : 22 लिंग : महिला $011145994नाम : वीरेन्द्रपिता
> का नाम: लक्ष्मण दास~मकान न॑. : 37आयु : 29 लिंग : पुरुष | 15$011141563नाम :
> सोनी कुमारीपिता का नाम: लक्ष्मणमकान न॑. : 37फोटो उपलब्धआयु : 21 लिंग :
> महिला $011143296[ 1a]नाम : सवितापति का नामः जयबीरमकान नं. : 45~आयु : 39
> लिंग : महिला | 7]$011152537नाम : ज्योतिपति का नामः प्रवीनमकान नं. : 45~आयु
> : 27 लिंग : महिला $011164177a)नाम : संजयपिता का नाम: रामदियामकान नं. :
> 50फोटो उपलब्धआयु : 22 लिंग : पुरुष 19 |$011152560नाम: मीनूपति का नामः
> सोनूमकान नं. : 52~आयु : 24 लिं

[tesseract-ocr] Tesseract

2024-06-11 Thread Ahmed Khalid
I have a problem that tesseract sometime read next line only and miss first 
one how can i handle that

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e3718abe-bf40-4887-9046-f6388406989cn%40googlegroups.com.


[tesseract-ocr] Tesseract arabic numbers

2024-05-23 Thread Ahmed Khalid
I want use Tesseract to do OCR for arabic data and this is a first time to 
do a vision project so i want from anyone guide me what should i do and if 
should i finetune nodel or not and how. 
thanks in advance

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6ad466a0-3158-40a4-b92c-ea4eb2c0733fn%40googlegroups.com.


[tesseract-ocr] Tesseract fine tuning questions

2024-05-11 Thread Antonio Jimeno Yepes


  Hi,

  We are interested in improving the performance of Tesseract and we have 
prepared a large set with over 11k pages annotated manually with text lines 
bounding boxes and the transcribed text. We have been evaluating fine 
tuning Tesseract with this set and we observed that there is a slight 
decrease in performance and we would like to identify the issue and run the 
fine tuning again. We have some questions about the process and we would be 
helpful if you could help us understanding the fine tuning process for 
Tesseract.

  We have done several tests to fine-tune Tesseract using this set with 
mixed results. We evaluate the performance agains an existing benchmark 
that we name the mini-holistic set. The metrics that we consider are 
Levenshtein distance and % of missing words (which considers unique words). 
Using our manually annotated set we obtain a similar Levenshtein distance 
(probably not statistically different) but we get a higher % of missing 
words, e.g. from 7% to over 9.6%.

   1. We realized our fine-tuned model degraded performance on scanned 
   documents, so we used PIL to add noise to the preprocessed bounding boxes 
   and train them with high quality data together. We added noise with random 
   combination of rescaling, rotation, blur and salt and pepper noise. 

Our results were mixed; we saw significant improvement in some files while 
others got a lot worse. Documents with tables and documents that seem 
not-scanned saw an improvement in the evaluation metrics. With scanned 
documents, the fine tuned seemed to perform worst with the fine-tuned 
model. The polarization effect was greater compared to training with just 
high-quality data.

   - Is the way we do augmentation correct? 
   - What can potentially cause this kind of mixed results? 


   1. We have tried different parameters. One of the is 
   perfect_sample_delay with different values, from 1 to 100 to remove the 
   impact of examples for which Tesseract had a perfect output. 

We find that there is no impact using this parameter, we find that the BCER 
is similar to other experiments without this parameter.

   - Is our understanding of this parameter correct? 
   - Why we might not see any impact when using this parameter? 


   1. We have tried splitting the set into examples for which Tesseract 5 
   has a perfect output (a) and examples for which it fails to produce a 
   perfect output (b). 

We find that the (a) set obtains a low BCER 0.042 during training, while 
(b) gets ~6% BCER, but the performance in Levenshtein distance and % of 
missing words is similar to previous output with both (a) and (b).

   - Performance is similar despite different performance. Why do you think 
   is the case? 
   - In this case, the cases that are correct initially with Tesseract 
   should have no or limited impact in the training. 
   - Using the perfect_sample_delay might prevent any learning from 
   happening since all the examples are initially perfect in the (a) set (we 
   checked values from 1 to 100). Why do we see no impact? How would you 
   recommend logging that this parameter is working as expected? 

  Thank you in advance for you help,
  Antonio

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0d2518c7-ed93-4496-8ee5-14c0e15efd6en%40googlegroups.com.


[tesseract-ocr] Tesseract for passport data

2024-05-04 Thread Shakhzodbek Rakhmatov
Hi, i am making the automation bot for telegram that takes the user data 
from passports and automatically writes the data and converts this text 
into a string. I tried to use the tesseract but it replies with some 
jibberish strings and symbols and i would like to know if there a better 
way to optimize it and make this bot better. Thanks for responses.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f5690588-03c1-4116-a960-624d6840161en%40googlegroups.com.


[tesseract-ocr] Tesseract OCR for analysing hand-written exams papers

2024-04-30 Thread Oscar Gledel
Hi,

I've come here after quite a few attempts and tests with tesseract as part 
of a university study project in France. The aim of this project is to 
analyse exam papers written by students in order to facilitate marking.
Our teacher wanted an open-source OCR tool, so we turned to Tesseract.

Although we've made a few attempts, I'd like your opinion on the use and 
training of Tesseract in the context of handwritten text, more precisely on 
single digits images. At this moment, we already tested this :

- Tested fra and eng traineddata on MNIST *< 65 % precision* (with PSM 10 
and PSM 13)
- Training tesseract on MNIST dataset (only fine-tuning because training 
from scratch do not worked) *< 30 % precision*
- Tested fra and eng traineddata on our custom images (made from students 
exams papers) *< 50 %*

We are not specifically looking for high precision rates, 85 - 90 % will be 
enough because we compare the results with a database of students IDs.
Here are our interogations :

- Is it possible to reach a higher precision rate on handwritten text, and 
how ?
- Is there some existing models trained for handwritten recognition ?
- Is there some existing models trained for only digit recognition ? 
Otherwise, is it possible to make tesseract recognize only digits (and so 
get only digits form the *getBestLSTMSymbolChoices()* function)
- What does the confidence value returned by tesseract correspond to ?


Thanks in advance for your help, I hope my english is understandable at 
least !

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f3113dbc-84d2-4255-bc52-e70e67be54bfn%40googlegroups.com.


[tesseract-ocr] Tesseract in Zoho deluge

2024-04-23 Thread Lenin Mariya Joseph
Is that possible to integrate Tesseract in zoho deluge script to extract 
and translate the passport datas

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3c9170ed-2499-43bd-9a81-4f5a54cc0ea2n%40googlegroups.com.


Re: [tesseract-ocr] tesseract misleading in 8 and 6

2024-04-18 Thread Zdenko Podobny
Unfortunately, your post is very vague. Unless you provide a detailed
description of what you are doing (step-by-step so we can replicate it),
nobody can help you.


Zdenko


st 17. 4. 2024 o 12:14 Jayrajsinh Zala 
napísal(a):

> I train tesseract ocr using MATLAB and use specific train data file but
> still getting error in 8 and 6 .
>
> I attach all images that i used for training and i am getting error for
> same type of images.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/fc82a651-72ac-48d6-9f50-a754bfc0abc6n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zM%2B7VSqjyiZPEA4s6Q0ef3iZbyfZ82-4NHs92u%2BBJ79w%40mail.gmail.com.


[tesseract-ocr] tesseract misleading in 8 and 6

2024-04-17 Thread Jayrajsinh Zala
I train tesseract ocr using MATLAB and use specific train data file but 
still getting error in 8 and 6 .

I attach all images that i used for training and i am getting error for 
same type of images.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fc82a651-72ac-48d6-9f50-a754bfc0abc6n%40googlegroups.com.


[tesseract-ocr] Tesseract to recognize images or shapes

2024-04-16 Thread achille sadjang


Hello everyone,

I have a concern: is it possible to train Tesseract to recognize images or 
shapes? If so, could someone guide me on how to proceed?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/93b60a77-7688-4c22-9aa2-473485506f93n%40googlegroups.com.


Re: [tesseract-ocr] tesseract is reading passport mrz text from image incorrectly, its identifying <<<<<<<< as kkkk or cccc

2024-01-27 Thread Zdenko Podobny
Well in this case it works without image processing ;-)

Anyway mrz is not "official" Tesseract training and there are people who
play with it, so it will take some time to search and dig
their findings/experience/expertise

Zdenko


so 27. 1. 2024 o 12:02 sara waheed  napísal(a):

> if I didn't research how would I know Tesseract needs image processing? I
> am new to OCR and in the learning phase please be kind and help thanks :)
>
> On Saturday, January 27, 2024 at 3:26:40 PM UTC+5 zdenop wrote:
>
>> What about reading docs and a little bit googling?
>>
>> tesseract two-page-passport-mrz-detected.jpeg - --psm 6 -l mrz
>>
>> IDAUT1999<6<<<
>> 7109094F1112315AUT<<<6
>> MUSTERFRAU<>
>>
>> Zdenko
>>
>>
>> so 27. 1. 2024 o 11:19 sara waheed  napísal(a):
>>
>>> I am trying to read the passport mrz string from the image i am using
>>> Tesseract and OpenCV for image processing i have tried three different ways
>>>  none of them worked
>>>
>>> **Attempt 1**
>>> I have this image  when i do ocr on it teseract read as
>>>
>>> IDAUT1999<6<<<
>>> 7109094F1112315AUT<>> MUSTERFRAU<>>
>>> which is incorrect it treats <<< as x or c or k when I use the
>>> `mrz-java` library to read the details from the string it gives the
>>> following error
>>>
>>> [error] Error parsing MRZ string: Failed to parse MRZ MRTD_TD1
>>> IDAUT1999<6<<<
>>> [error] 7109094F1112315AUT<>> [error] MUSTERFRAU<>> [error]  at 24-25,1: Invalid character in MRZ record: x
>>>
>>> **Attempt 2**
>>>
>>> then I converted the image to grayscale and binarized it using `OpenCV`
>>> Here is the below code
>>>
>>> val roiImagePath =
>>> "src/main/resources/ocr/passport/two-page-passport-mrz-detected.jpeg"
>>>
>>> val grayScaleROI = new Mat()
>>>   val roiImage = Imgcodecs.imread(roiImagePath)
>>>   Imgproc.cvtColor(roiImage, grayScaleROI,
>>> Imgproc.COLOR_BGR2GRAY)
>>>   val roiGaryImagePath =
>>> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-gray.jpeg"
>>>
>>>   Imgcodecs.imwrite(roiGaryImagePath, grayScaleROI)
>>>   val binary = new Mat()
>>>   Imgproc.adaptiveThreshold(grayScaleROI, binary, 255,
>>> Imgproc.ADAPTIVE_THRESH_MEAN_C, Imgproc.THRESH_BINARY , 15, 25)
>>>   val roiBinaryImagePath =
>>> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-binary.jpeg"
>>>   Imgcodecs.imwrite(roiBinaryImagePath, binary)
>>>
>>>  val tesseract = new Tesseract()
>>>   tesseract.setDatapath("/usr/share/tesseract-ocr/4.00/tessdata")
>>>   tesseract.setVariable("user_defined_dpi", "600")
>>>   val result = tesseract.doOCR(new File(roiBinaryImagePath))
>>>   val mrzStr = result.replace(" ", "")
>>>   println(s"two page passport mrz string is: "+mrzStr)
>>>
>>> it created the following binary image
>>>
>>> and the code output is
>>> tesseract reads mrz string from the binary image as
>>>
>>> IDAUT1DODD999>> 7AD9D9GF1TEZSISAUTKEKG
>>> MUSTERFRAUSKISOLDEKKK
>>> and `mrz-java` reads the string and generates the following error
>>>
>>> [error] Error parsing MRZ string: Failed to parse MRZ null
>>> IDAUT1DODD999>> [error] 7AD9D9GF1TEZSISAUTKEKG
>>> [error] MUSTERFRAUSKISOLDEKKK
>>> [error]  at 0-0,0: Different row lengths: 0: 29 and 1: 30
>>>
>>> **Attempt 3**
>>>
>>> then I resized the image
>>>
>>> Val width = 1000 // Increase width proportionately (adjust based on
>>> your needs)
>>>   val height = (width * binary.rows()) / binary.cols() // Maintain
>>> aspect ratio
>>>
>>>   val resizedRoiImage = new Mat()
>>>   Imgproc.resize(binary, resizedRoiImage, new Size(width, height),
>>> 0.0, 0.0, Imgproc.INTER_NEAREST)
>>>
>>>   val resizedImageROIPath =
>>>  
>>> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-binary-resized_image.jpg"
>>>   Imgcodecs.imwrite(resizedImageROIPath, resizedRoiImage)
>>>
>>> mrz string read by Tesseract
>>>
>>> TOAUTIIISKhcceddce
>>> FIOPOSAFIFESSISAUTReececeececs
>>> MUSTERFRAUCCKISOLDECKdcddd
>>>
>>> and the error is
>>>
>>> [info] 15:54:04.200 633 [main] MrzParser INFO - Check digit
>>> verification failed for document number: expected 0 but got h
>>> [error] Error parsing MRZ string: Failed to parse MRZ MRTD_TD1
>>> TOAUTIIISKhcceddce
>>> [error] FIOPOSAFIFESSISAUTReececeececs
>>> [error] MUSTERFRAUCCKISOLDECKdcddd
>>> [error]  at 15-16,0: Invalid character in MRZ record: c
>>>
>>>
>>> can anyone please help how I read the text properly also I have tried
>>> one regex to convert c or k back to <<< it did not work either if anyone
>>> can suggest some workaround or any improvement in code please help me with
>>> that thanks
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from

Re: [tesseract-ocr] tesseract is reading passport mrz text from image incorrectly, its identifying <<<<<<<< as kkkk or cccc

2024-01-27 Thread sara waheed
if I didn't research how would I know Tesseract needs image processing? I 
am new to OCR and in the learning phase please be kind and help thanks :)   

On Saturday, January 27, 2024 at 3:26:40 PM UTC+5 zdenop wrote:

> What about reading docs and a little bit googling?
>
> tesseract two-page-passport-mrz-detected.jpeg - --psm 6 -l mrz
>
> IDAUT1999<6<<<
> 7109094F1112315AUT<<<6
> MUSTERFRAU<
>
> Zdenko
>
>
> so 27. 1. 2024 o 11:19 sara waheed  napísal(a):
>
>> I am trying to read the passport mrz string from the image i am using 
>> Tesseract and OpenCV for image processing i have tried three different ways 
>>  none of them worked 
>>
>> **Attempt 1**
>> I have this image  when i do ocr on it teseract read as 
>>
>> IDAUT1999<6<<<
>> 7109094F1112315AUT<> MUSTERFRAU<>
>> which is incorrect it treats <<< as x or c or k when I use the `mrz-java` 
>> library to read the details from the string it gives the following error 
>>
>> [error] Error parsing MRZ string: Failed to parse MRZ MRTD_TD1 
>> IDAUT1999<6<<<
>> [error] 7109094F1112315AUT<> [error] MUSTERFRAU<> [error]  at 24-25,1: Invalid character in MRZ record: x
>>
>> **Attempt 2**
>>
>> then I converted the image to grayscale and binarized it using `OpenCV` 
>> Here is the below code 
>>
>> val roiImagePath = 
>> "src/main/resources/ocr/passport/two-page-passport-mrz-detected.jpeg"
>> 
>> val grayScaleROI = new Mat()
>>   val roiImage = Imgcodecs.imread(roiImagePath)
>>   Imgproc.cvtColor(roiImage, grayScaleROI, Imgproc.COLOR_BGR2GRAY)
>>   val roiGaryImagePath = 
>> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-gray.jpeg"
>> 
>>   Imgcodecs.imwrite(roiGaryImagePath, grayScaleROI)
>>   val binary = new Mat()
>>   Imgproc.adaptiveThreshold(grayScaleROI, binary, 255, 
>> Imgproc.ADAPTIVE_THRESH_MEAN_C, Imgproc.THRESH_BINARY , 15, 25)
>>   val roiBinaryImagePath = 
>> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-binary.jpeg"
>>   Imgcodecs.imwrite(roiBinaryImagePath, binary)
>> 
>>  val tesseract = new Tesseract()
>>   tesseract.setDatapath("/usr/share/tesseract-ocr/4.00/tessdata")
>>   tesseract.setVariable("user_defined_dpi", "600")
>>   val result = tesseract.doOCR(new File(roiBinaryImagePath))
>>   val mrzStr = result.replace(" ", "")
>>   println(s"two page passport mrz string is: "+mrzStr)
>>
>> it created the following binary image
>>
>> and the code output is 
>> tesseract reads mrz string from the binary image as 
>>
>> IDAUT1DODD999> 7AD9D9GF1TEZSISAUTKEKG
>> MUSTERFRAUSKISOLDEKKK
>> and `mrz-java` reads the string and generates the following error 
>>
>> [error] Error parsing MRZ string: Failed to parse MRZ null 
>> IDAUT1DODD999> [error] 7AD9D9GF1TEZSISAUTKEKG
>> [error] MUSTERFRAUSKISOLDEKKK
>> [error]  at 0-0,0: Different row lengths: 0: 29 and 1: 30
>>
>> **Attempt 3**
>>
>> then I resized the image 
>>
>> Val width = 1000 // Increase width proportionately (adjust based on 
>> your needs)
>>   val height = (width * binary.rows()) / binary.cols() // Maintain 
>> aspect ratio
>> 
>>   val resizedRoiImage = new Mat()
>>   Imgproc.resize(binary, resizedRoiImage, new Size(width, height), 
>> 0.0, 0.0, Imgproc.INTER_NEAREST)
>> 
>>   val resizedImageROIPath = 
>>  
>> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-binary-resized_image.jpg"
>>   Imgcodecs.imwrite(resizedImageROIPath, resizedRoiImage)
>>
>> mrz string read by Tesseract
>>
>> TOAUTIIISKhcceddce
>> FIOPOSAFIFESSISAUTReececeececs
>> MUSTERFRAUCCKISOLDECKdcddd
>>
>> and the error is 
>>
>> [info] 15:54:04.200 633 [main] MrzParser INFO - Check digit 
>> verification failed for document number: expected 0 but got h
>> [error] Error parsing MRZ string: Failed to parse MRZ MRTD_TD1 
>> TOAUTIIISKhcceddce
>> [error] FIOPOSAFIFESSISAUTReececeececs
>> [error] MUSTERFRAUCCKISOLDECKdcddd
>> [error]  at 15-16,0: Invalid character in MRZ record: c
>>
>>   
>> can anyone please help how I read the text properly also I have tried one 
>> regex to convert c or k back to <<< it did not work either if anyone can 
>> suggest some workaround or any improvement in code please help me with that 
>> thanks
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/440788ab-1d76-4612-a4b5-a1a4c2cd09a5n%40googlegroups.com
>>  
>> 

Re: [tesseract-ocr] tesseract is reading passport mrz text from image incorrectly, its identifying <<<<<<<< as kkkk or cccc

2024-01-27 Thread Zdenko Podobny
What about reading docs and a little bit googling?

tesseract two-page-passport-mrz-detected.jpeg - --psm 6 -l mrz

IDAUT1999<6<<<
7109094F1112315AUT<<<6
MUSTERFRAU< napísal(a):

> I am trying to read the passport mrz string from the image i am using
> Tesseract and OpenCV for image processing i have tried three different ways
>  none of them worked
>
> **Attempt 1**
> I have this image  when i do ocr on it teseract read as
>
> IDAUT1999<6<<<
> 7109094F1112315AUT< MUSTERFRAU<
> which is incorrect it treats <<< as x or c or k when I use the `mrz-java`
> library to read the details from the string it gives the following error
>
> [error] Error parsing MRZ string: Failed to parse MRZ MRTD_TD1
> IDAUT1999<6<<<
> [error] 7109094F1112315AUT< [error] MUSTERFRAU< [error]  at 24-25,1: Invalid character in MRZ record: x
>
> **Attempt 2**
>
> then I converted the image to grayscale and binarized it using `OpenCV`
> Here is the below code
>
> val roiImagePath =
> "src/main/resources/ocr/passport/two-page-passport-mrz-detected.jpeg"
>
> val grayScaleROI = new Mat()
>   val roiImage = Imgcodecs.imread(roiImagePath)
>   Imgproc.cvtColor(roiImage, grayScaleROI, Imgproc.COLOR_BGR2GRAY)
>   val roiGaryImagePath =
> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-gray.jpeg"
>
>   Imgcodecs.imwrite(roiGaryImagePath, grayScaleROI)
>   val binary = new Mat()
>   Imgproc.adaptiveThreshold(grayScaleROI, binary, 255,
> Imgproc.ADAPTIVE_THRESH_MEAN_C, Imgproc.THRESH_BINARY , 15, 25)
>   val roiBinaryImagePath =
> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-binary.jpeg"
>   Imgcodecs.imwrite(roiBinaryImagePath, binary)
>
>  val tesseract = new Tesseract()
>   tesseract.setDatapath("/usr/share/tesseract-ocr/4.00/tessdata")
>   tesseract.setVariable("user_defined_dpi", "600")
>   val result = tesseract.doOCR(new File(roiBinaryImagePath))
>   val mrzStr = result.replace(" ", "")
>   println(s"two page passport mrz string is: "+mrzStr)
>
> it created the following binary image
>
> and the code output is
> tesseract reads mrz string from the binary image as
>
> IDAUT1DODD999 7AD9D9GF1TEZSISAUTKEKG
> MUSTERFRAUSKISOLDEKKK
> and `mrz-java` reads the string and generates the following error
>
> [error] Error parsing MRZ string: Failed to parse MRZ null
> IDAUT1DODD999 [error] 7AD9D9GF1TEZSISAUTKEKG
> [error] MUSTERFRAUSKISOLDEKKK
> [error]  at 0-0,0: Different row lengths: 0: 29 and 1: 30
>
> **Attempt 3**
>
> then I resized the image
>
> Val width = 1000 // Increase width proportionately (adjust based on
> your needs)
>   val height = (width * binary.rows()) / binary.cols() // Maintain
> aspect ratio
>
>   val resizedRoiImage = new Mat()
>   Imgproc.resize(binary, resizedRoiImage, new Size(width, height),
> 0.0, 0.0, Imgproc.INTER_NEAREST)
>
>   val resizedImageROIPath =
>  
> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-binary-resized_image.jpg"
>   Imgcodecs.imwrite(resizedImageROIPath, resizedRoiImage)
>
> mrz string read by Tesseract
>
> TOAUTIIISKhcceddce
> FIOPOSAFIFESSISAUTReececeececs
> MUSTERFRAUCCKISOLDECKdcddd
>
> and the error is
>
> [info] 15:54:04.200 633 [main] MrzParser INFO - Check digit
> verification failed for document number: expected 0 but got h
> [error] Error parsing MRZ string: Failed to parse MRZ MRTD_TD1
> TOAUTIIISKhcceddce
> [error] FIOPOSAFIFESSISAUTReececeececs
> [error] MUSTERFRAUCCKISOLDECKdcddd
> [error]  at 15-16,0: Invalid character in MRZ record: c
>
>
> can anyone please help how I read the text properly also I have tried one
> regex to convert c or k back to <<< it did not work either if anyone can
> suggest some workaround or any improvement in code please help me with that
> thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/440788ab-1d76-4612-a4b5-a1a4c2cd09a5n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xbT8jWSOveXeSRCHE_Vr%2Bx%3DoXo0k4yuqtL_MUH%2BN6rRA%40mail.gmail.com.


[tesseract-ocr] Tesseract arabic numbers

2024-01-04 Thread Ahmed Khalid
I have a problem that i want to use tesseract to read arabic numbers but it 
has low accuracy about this and give me incorrect reading.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/58eed622-2cac-4405-9401-0868bc955043n%40googlegroups.com.


Re: [tesseract-ocr] Tesseract Fonts Recognition

2023-12-01 Thread Ger Hobbelt
Are you sure you're asking this in the correct group/forum?

Tesseract is an OCR tool and "font mapping" has nothing to do with OCR but
is encountered, for example, when reprocessing pdf files (extracting
content and layouting it again, that sort of thing).
And then you generally only have to map times new Roman and dreaded Arial
when you're mixing mswindows and apple platforms. Otherwise it just
extracting font identifiers and making sure you've got the correct
TrueType/open type fonts installed. Many require purchasing a license, if
you haven't already.


On Fri, 1 Dec 2023, 13:36 Timo Sternat,  wrote:

> Hi community,
>
> We are interested in mapping Windows fonts to Linux fonts. Do any of you
> have experience with font mappings from Windows to Linux that have worked
> well for you?
>
> Thank you for your recommendations.
>
> Best regards, Timo
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/eac15510-36b3-48c6-9ced-51247260ad28n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpye_0z3i3Cmkmbp%3Db-GQf%2BcauHF1i0z9qS2sLn0ASW2g%40mail.gmail.com.


[tesseract-ocr] Tesseract Fonts Recognition

2023-12-01 Thread Timo Sternat


Hi community,

We are interested in mapping Windows fonts to Linux fonts. Do any of you 
have experience with font mappings from Windows to Linux that have worked 
well for you?

Thank you for your recommendations.

Best regards, Timo

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/eac15510-36b3-48c6-9ced-51247260ad28n%40googlegroups.com.


Re: [tesseract-ocr] Tesseract on single digit detection

2023-11-27 Thread Zdenko Podobny
Crop images properly (without borders) and follow suggestions in docs:

>tesseract pic2_cropped_postprocessed.png - --psm 10
5
>tesseract pic4_cropped_postprocessed.png - --psm 10
7

Zdenko


po 27. 11. 2023 o 9:42 Fernando Benayas de los Santos <
ferbenaya...@gmail.com> napísal(a):

> Hi guys/gals!
> Long story short: I'm trying to use tesseract to extract a single digit
> from a small image (containing a single digit) but I can't get over ~50%
> accuracy.
>
> I have attached some examples of images. It should be pretty easy to
> extract the digit.
>
> So far, the best approach consists in using --psm 13 and zoom a bit to get
> the frame out. Tried to change the image to black/white, but it didn't
> solve much :(
>
> Any ideas? (I'm playing around with obscure -c options but they don't seem
> to have any effect)
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/5784d031-b282-4a99-b5e0-3b313a121488n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yP%2Be-ZE1R5uNJH%3DxiNnzfKOBAJ_tVuSX4ZduAp-_dK1g%40mail.gmail.com.


[tesseract-ocr] tesseract-ocr is not converting or extracting the text properly

2023-11-14 Thread Arul Britto Kumar Abraham
Hi,

I am using tesseract-ocr in my python code to convert non-searchable pdf to 
searchable pdf document, it is not converting fully...



I am using "poppler-23.08.0" to convert the PDF page to images
from this image I am using  "pytesseract.image_to_pdf_or_hocr" method to 
convert to PDF files and later I am combining all the page and making as 
single fine using PDFFileMerger.

can anyone shar your thoughts here...

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9ce4e11d-5d20-4cb8-bc6b-745913120799n%40googlegroups.com.


Re: [tesseract-ocr] Tesseract training for New font/language

2023-10-02 Thread Fish Money
please share sample of image you're trying to recognize

суббота, 1 апреля 2023 г. в 10:56:58 UTC-4, ali8a...@gmail.com: 

> Is it best to train a new language? 
>
> On Saturday, April 1, 2023 at 7:54:30 a.m. UTC-7 shree wrote:
>
>> Aurebesh seems to be different symbols mapped to the English alphabet 
>> rather than a new font for English, hence training would need to be for a 
>> new language rather than just fine-tuning.
>>
>> On Sat, Apr 1, 2023, 10:47 Ali Abedian  wrote:
>>
>>> Hello,
>>>
>>> Thank you for providing the references, but I'm still a bit confused. I 
>>> have trained tesseract using the same method as described in 
>>> https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip, 
>>> with 100,000 sentences and a maximum iteration of 10,000. However, it still 
>>> cannot recognize a 6-letter word that I input from a TIF file using the 
>>> same font and settings. I have tried using fewer iterations, such as 1,000, 
>>> as well as more iterations, such as 20,000 and 100,000, but still no 
>>> results. Additionally, the BCER (Character Error Rate) doesn't seem to 
>>> change significantly with largere iterations, remaining at 3.56%. I'm 
>>> unsure of what I'm doing wrong or what I should do next, but any help would 
>>> be appreciated.
>>>
>>> Thank you.
>>> On Saturday, April 1, 2023 at 12:05:36 a.m. UTC-7 zdenop wrote:
>>>
 Please have a look  at https://github.com/tesseract-ocr/tesstrain 
 (especially 
 https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip)


 Zdenko


 pi 31. 3. 2023 o 7:03 Ali Abedian  napísal(a):

> Hey everyone! I'm currently working on a personal project where I'm 
> training a new font for the English language using Tesseract. The font is 
> called Aurebesh and it's from the Star Wars universe. Basically, each 
> letter in Aurebesh corresponds to a letter in English. I've collected 
> close 
> to 100,000 images and their corresponding translations, but I'm not sure 
> how many iterations I should run for a file of this size. I've tried 
> training with only 100 images, but it didn't work out. Can anyone advise 
> me 
> on how many iterations I should run and whether it's even possible to 
> train 
> a new font like this?
>
> -- 
> You received this message because you are subscribed to the Google 
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send 
> an email to tesseract-oc...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/1b20c2e0-76b2-41a0-bc9f-e1a16b9c67a2n%40googlegroups.com
>  
> 
> .
>
 -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>>
>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/2cab8f1d-b81e-4926-a21b-8065a4178d04n%40googlegroups.com
>>>  
>>> 
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/56e4beb3-644b-4be6-8c21-84e9856ec013n%40googlegroups.com.


[tesseract-ocr] Tesseract - ORC unable to read IRCTC image

2023-09-19 Thread Sayan Bhandari
Hi PPl,
I found one unique image in IRCTC while tatkal booking ie 10AM IST and 11AM 
IST, during this time the website generates some unique type of captcha 
which has words in 2 differe colours divided horizontally which Tesseract 
is not able to read

If anyone has answer to this do let me know else we need to train Tesseract 
for this 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/81f158a4-a1b7-4deb-b685-debf84ad49d2n%40googlegroups.com.


Re: [tesseract-ocr] Tesseract Custom Model Not Recognized after Training

2023-09-18 Thread Zdenko Podobny
Unfortunately you hid all important information (e.g. how did you run
training? how did you run tesseract (including tesseract options, exact
command or code,...)? , so just some hints:

> Error: LSTM requested, but not present!!

This implies that the requested traineddata file does not contain needed
LSTM components.

Loading tesseract. Error: Tesseract (legacy) engine requested, but
> components are not present in /usr/share/tesseract-ocr/4.00/
> tessdata/ocrtensor.traineddata!!

This implies that the requested traineddata file does not contain needed
legacy components.

I never saw these 2 messages together. Typically people either follow some
old outdated tutorial and train tesseract legacy components or train for
LSTM engine (without legacy components), but ask tesseract to use legacy
engine...
Based on this I guess your ocrtensor.traineddata is not a valid tesseract
file.

Zdenko


ne 17. 9. 2023 o 17:41 demian kim  napísal(a):

> Body:
>
> Hello Tesseract Community,
>
> I am facing a challenge with my custom-trained Tesseract model, and I'm
> hoping for some guidance on resolving this issue.
>
> Background:
>
>1. I've successfully trained a custom model (ocrtensor.traineddata).
>2. The training finished without any error and I've copied the
>generated .traineddata file to /usr/share/tesseract-ocr/4.00/tessdata/.
>3. I'm trying to use this model in a Jupyter Notebook container with
>the pytesseract Python package.
>
> Problem:
>
> Even though the model was working fine previously, I am now encountering
> an error when trying to use the model. The error suggests that Tesseract
> can't initialize with the custom model:
> vbnetCopy code
> TesseractError: (1, "Error: LSTM requested, but not present!! Loading
> tesseract. Error: Tesseract (legacy) engine requested, but components are
> not present in
> /usr/share/tesseract-ocr/4.00/tessdata/ocrtensor.traineddata!! Failed
> loading language 'ocrtensor' Tesseract couldn't load any languages! Could
> not initialize tesseract.")
>
> Steps Tried:
>
>1. Ensured the Tesseract version compatibility (using version 4).
>2. Checked file permissions (even tried with chmod 777).
>3. Restarted Jupyter Notebook container multiple times.
>4. Tried executing Tesseract from the terminal directly.
>5. Made sure the TESSDATA_PREFIX environment variable is set correctly.
>6. Tried Tesseract with logging enabled for additional error details.
>
> I'm unsure why the model suddenly isn't recognized when it was working
> just a while ago. If anyone has insights or suggestions on what might be
> going wrong, I would greatly appreciate it.
>
> Thank you for your assistance.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/eac448cf-79f3-4b41-9400-397710fb43c7n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wUNPLbMy4jXGDgYER3bEAsUfKLUwfb8hnSJ-CMLSvtdw%40mail.gmail.com.


[tesseract-ocr] Tesseract Custom Model Not Recognized after Training

2023-09-17 Thread demian kim


Body:

Hello Tesseract Community,

I am facing a challenge with my custom-trained Tesseract model, and I'm 
hoping for some guidance on resolving this issue.

Background:

   1. I've successfully trained a custom model (ocrtensor.traineddata).
   2. The training finished without any error and I've copied the generated 
   .traineddata file to /usr/share/tesseract-ocr/4.00/tessdata/.
   3. I'm trying to use this model in a Jupyter Notebook container with the 
   pytesseract Python package.

Problem:

Even though the model was working fine previously, I am now encountering an 
error when trying to use the model. The error suggests that Tesseract can't 
initialize with the custom model:
vbnetCopy code
TesseractError: (1, "Error: LSTM requested, but not present!! Loading 
tesseract. Error: Tesseract (legacy) engine requested, but components are 
not present in 
/usr/share/tesseract-ocr/4.00/tessdata/ocrtensor.traineddata!! Failed 
loading language 'ocrtensor' Tesseract couldn't load any languages! Could 
not initialize tesseract.") 

Steps Tried:

   1. Ensured the Tesseract version compatibility (using version 4).
   2. Checked file permissions (even tried with chmod 777).
   3. Restarted Jupyter Notebook container multiple times.
   4. Tried executing Tesseract from the terminal directly.
   5. Made sure the TESSDATA_PREFIX environment variable is set correctly.
   6. Tried Tesseract with logging enabled for additional error details.

I'm unsure why the model suddenly isn't recognized when it was working just 
a while ago. If anyone has insights or suggestions on what might be going 
wrong, I would greatly appreciate it.

Thank you for your assistance.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/eac448cf-79f3-4b41-9400-397710fb43c7n%40googlegroups.com.


[tesseract-ocr] Tesseract performance On ID cards and passports

2023-09-01 Thread Alexey Pismenskiy
I'm looking into OCR for ID cards and drivers licenses, and I found out 
that tesseract performs relatively poor on ID cards, compared to other OCR 
solutions. For this original image: 
https://github.com/apismensky/ocr_id/blob/main/images/boxes_easy/AR.png the 
results are: 

tesseract: "4d DL 999 as = Ne allo) 2NICK © , q 12 RESTR oe } lick: 5 DD 
88 1234 SZ"
easyocr:  '''9 , ARKANSAS DRIVER'S LICENSE CLAss D 4d DLN 9 3 DOB 
03/05/1960 ] 2 SCKPLE 123 NORTH STREET CITY AR 12345 ISS 4b EXP 03/05/2018 
03/05/2026 15 SEX 16 HGT 18 EYES 5'-10" BRO 9a END NONE 12 RESTR NONE Ylck 
Sorble DD 88 1234 THE'''
google cloud vision: """SARKANSAS\nSAMPLE\nSTATE O\n9 CLASS D\n4d DLN 
93 DOB 03/05/1960\nNick Sample\nDRIVER'S LICENSE\n1 SAMPLE\n2 
NICK\n8 123 NORTH STREET\nCITY, AR 12345\n4a ISS\n03/05/2018\n15 SEX 16 
HGT\nM\n5'-10\"\nGREAT SE\n9a END NONE\n12 RESTR NONE\n5 DD 88 
1234\n4b EXP\n03/05/2026 MS60\n18 EYES\nBRO\nRKANSAS\n0"""

and word accuracy is:

 tesseract  |  easyocr  |  google
words 10.34%|  68.97%   |  82.76%

This is "out if the box" performance, without any preprocessing. I'm not 
surprised that google vision is that good compared to others, but easyocr, 
which is another open source solution performs much better than tesseract 
is this case. I have the whole project dedicated to this, and all other 
results are much better for easyocr: 
https://github.com/apismensky/ocr_id/blob/main/result.json, all input files 
are files in https://github.com/apismensky/ocr_id/tree/main/images/sources
After digging into it for a little bit, I suspect that bounding box 
detection is much better in google 
(https://github.com/apismensky/ocr_id/blob/main/images/boxes_google/AR.png) 
and easyocr 
(https://github.com/apismensky/ocr_id/blob/main/images/boxes_easy/AR.png), 
than in tesseract 
(https://github.com/apismensky/ocr_id/blob/main/images/boxes_tesseract/AR.png). 

I'm pretty sure, about this, cause when I manually cut the text boxes and 
feed them to tesseract it works much better. 


Now questions: 

- What is the part of the codebase in tesseract that is responsible for 
text detection and which algorithm is it using? 
- What is impacting bounding box detection in tesseract so it fails on 
these types of images (complex layouts / background noise... etc)
- Is it possible to use the same text detection procedure as easyocr or 
improve the existing one?  
- Maybe possible to switch text detection algo based on the image type or 
make it pluggable where user can configure from several options A,B,C...


Thanks. 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4f213209-cdee-4d73-a838-1aac4bb0b9afn%40googlegroups.com.


Re: [tesseract-ocr] Tesseract Failure in Jenkins execution on Windows VM

2023-08-30 Thread Ger Hobbelt
Hi,

[error] see:
https://github.com/RaiMan/SikuliX1/wiki/Windows:-Problems-with-libraries-OpenCV-or-Tesseract
[error]
Save your work, correct the problem and restart the IDE! __03_09_16PM ::
com.disney.sikuli.common.ScreenshotListener#onTestFailure > Failure with
method:testlogin __03_09_16PM :: com.disney.sikuli.common.
ScreenshotListener#onTestFailure > Failure with error: OCR: start:
Tesseract library problems: The specified module could not be found.

This reads as a VM specific problem: the error message indicates the
library cannot be found hence suspect Number One is the .dll/.so dynamic
library search path configuration or a similar OS level configuration
parameter that needs to be edited to make this error go away.

I don't know sikuli, but you might want to ask around over there to see if
someone had this same error message for tesseract or any other
package/library required by sikuli and then you might get some useful hints
as where to look in your VM config for the path(s) that need to be edited.

Good luck,

Ger


On Wed, 30 Aug 2023, 16:17 DeviPrasad Patnala, 
wrote:

> Hello Everyone,
>
> Can anyone help to resolve my issue.
> I am working on a Sikuli Automation Framework for a desktop Application. I
> have used small line of code as below to extract text from the screen. It
> is working without any issues in my local machine but whereas when I tried
> running it on Windows VM via Jenkins I am getting error as below.
> Error:
> [error] see:
> https://github.com/RaiMan/SikuliX1/wiki/Windows:-Problems-with-libraries-OpenCV-or-Tesseract
> [error] Save your work, correct the problem and restart the IDE!
> __03_09_16PM :: com.disney.sikuli.common.ScreenshotListener#onTestFailure >
> Failure with method:testlogin __03_09_16PM ::
> com.disney.sikuli.common.ScreenshotListener#onTestFailure > Failure with
> error: OCR: start: Tesseract library problems: The specified module could
> not be found.
>
> *Code:*
> Region region = new Region(807, 59, 299, 52);
>
> screen.setRect(region);
>
> String text = screen.text();
>
> System.out.println("Extracted Text = " + text);
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/870b1efa-aaee-4041-81e6-f0845e5a56e0n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fo4J5C7GyzSfp%2B2EMrnCt-DakOR%3DtWqBW-6Nrv0v0LhDQ%40mail.gmail.com.


[tesseract-ocr] Tesseract Failure in Jenkins execution on Windows VM

2023-08-30 Thread DeviPrasad Patnala
Hello Everyone,

Can anyone help to resolve my issue.
I am working on a Sikuli Automation Framework for a desktop Application. I 
have used small line of code as below to extract text from the screen. It 
is working without any issues in my local machine but whereas when I tried 
running it on Windows VM via Jenkins I am getting error as below.
Error:
[error] see: 
https://github.com/RaiMan/SikuliX1/wiki/Windows:-Problems-with-libraries-OpenCV-or-Tesseract
 
[error] Save your work, correct the problem and restart the IDE! 
__03_09_16PM :: com.disney.sikuli.common.ScreenshotListener#onTestFailure > 
Failure with method:testlogin __03_09_16PM :: 
com.disney.sikuli.common.ScreenshotListener#onTestFailure > Failure with 
error: OCR: start: Tesseract library problems: The specified module could 
not be found.

*Code:*
Region region = new Region(807, 59, 299, 52);

screen.setRect(region);

String text = screen.text();

System.out.println("Extracted Text = " + text);

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/870b1efa-aaee-4041-81e6-f0845e5a56e0n%40googlegroups.com.


Re: [tesseract-ocr] Tesseract-ocr in quiet mode

2023-07-23 Thread astro

HI,
 Just found the solution. Here is a code snippet in case anyone is 
interested.


|Dim p as New ProcessStartInfo(@"command", args) p.WindowStyle = 
ProcessWindowStyle.Hidden p.CreateNoWindow = true Process.Start(p) 
Cheers Nor |




On 7/23/2023 10:13 AM, astro wrote:

Hi Zdenko
THanks for that reply. I wasn't sure if that was the case or not. 
Guess I just have to live with it.


Cheers
 NOr

On 7/23/2023 10:01 AM, Zdenko Podobny wrote:
It is not a tesseract problem but the VB. Prove for this you can find 
in pytesseract that call tesseract executable without console windows.


Zdenko


ne 23. 7. 2023 o 15:55 nor s  napísal(a):

Is there a way to have Tesseract run without producing a Dos
window? I'm incorporating a call to Tesseract-ocr in my VB.net
application to read some date info from an image. Each time I 
execute Tesseract I get a dos window popping up.  I'm on windows
10 and Tesseract 5.0
Thanks
 Nor

-- 
You received this message because you are subscribed to the

Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit

https://groups.google.com/d/msgid/tesseract-ocr/069f46f9--4b71-85f8-62dd28b77702n%40googlegroups.com

.

--
You received this message because you are subscribed to the Google 
Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, 
send an email to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8ynd%3DDY_5pXZ%2BMtxtHyEeOHfhfCVNn04ezut1ORckg1Zg%40mail.gmail.com 
.




--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bf8b20da-9cdb-caa2-deec-4e5ba81dee55%40gmail.com.


Re: [tesseract-ocr] Tesseract-ocr in quiet mode

2023-07-23 Thread astro

Hi Zdenko
THanks for that reply. I wasn't sure if that was the case or not. Guess 
I just have to live with it.


Cheers
 NOr

On 7/23/2023 10:01 AM, Zdenko Podobny wrote:
It is not a tesseract problem but the VB. Prove for this you can find 
in pytesseract that call tesseract executable without console windows.


Zdenko


ne 23. 7. 2023 o 15:55 nor s  napísal(a):

Is there a way to have Tesseract run without producing a Dos
window? I'm incorporating a call to Tesseract-ocr in my VB.net
application to read some date info from an image. Each time I 
execute Tesseract I get a dos window popping up.  I'm on windows
10 and Tesseract 5.0
Thanks
 Nor

-- 
You received this message because you are subscribed to the Google

Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit

https://groups.google.com/d/msgid/tesseract-ocr/069f46f9--4b71-85f8-62dd28b77702n%40googlegroups.com

.

--
You received this message because you are subscribed to the Google 
Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send 
an email to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8ynd%3DDY_5pXZ%2BMtxtHyEeOHfhfCVNn04ezut1ORckg1Zg%40mail.gmail.com 
.


--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c5c6b79c-3217-ad42-3445-fa0d64d843d8%40gmail.com.


Re: [tesseract-ocr] Tesseract-ocr in quiet mode

2023-07-23 Thread Zdenko Podobny
It is not a tesseract problem but the VB. Prove for this you can find in
pytesseract that call tesseract executable without console windows.

Zdenko


ne 23. 7. 2023 o 15:55 nor s  napísal(a):

> Is there a way to have Tesseract run without producing a Dos window? I'm
> incorporating a call to Tesseract-ocr in my VB.net application to read some
> date info from an image. Each time I  execute Tesseract I get a dos window
> popping up.  I'm on windows 10 and Tesseract 5.0
> Thanks
>  Nor
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/069f46f9--4b71-85f8-62dd28b77702n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8ynd%3DDY_5pXZ%2BMtxtHyEeOHfhfCVNn04ezut1ORckg1Zg%40mail.gmail.com.


[tesseract-ocr] Tesseract-ocr in quiet mode

2023-07-23 Thread nor s
Is there a way to have Tesseract run without producing a Dos window? I'm 
incorporating a call to Tesseract-ocr in my VB.net application to read some 
date info from an image. Each time I  execute Tesseract I get a dos window 
popping up.  I'm on windows 10 and Tesseract 5.0
Thanks
 Nor

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/069f46f9--4b71-85f8-62dd28b77702n%40googlegroups.com.


Re: [tesseract-ocr] tesseract runs but gives no output

2023-07-14 Thread Zdenko Podobny
tesseract d:\temp\temp\Screenshot_20230601_102638.jpg  -l eng+hin
1>>c:\temp\temp2.txt

is not the correct command. Did you mean:

tesseract d:\temp\temp\Screenshot_20230601_102638.jpg  output -l eng+hin
1>>c:\temp\temp2.txt

please consult
 tesseract --help

Zdenko


pi 14. 7. 2023 o 14:19 Ales Ropia  napísal(a):

> C:\Program Files (x86)\Tesseract-OCR>for %i in (d:\temp\temp\*.*) do
> (tesseract %i >> c:\temp\temp2.txt -l eng+hin)
>
> C:\Program Files (x86)\Tesseract-OCR>(tesseract
> d:\temp\temp\Screenshot_20230601_102548.jpg  -l eng+hin
> 1>>c:\temp\temp2.txt )
> read_params_file: Can't open eng+hin
> Tesseract Open Source OCR Engine v4.0.0.20181030 with Leptonica
> Warning: Invalid resolution 0 dpi. Using 70 instead.
> Estimating resolution as 434
>
> C:\Program Files (x86)\Tesseract-OCR>(tesseract
> d:\temp\temp\Screenshot_20230601_102638.jpg  -l eng+hin
> 1>>c:\temp\temp2.txt )
> read_params_file: Can't open eng+hin
> Tesseract Open Source OCR Engine v4.0.0.20181030 with Leptonica
> Warning: Invalid resolution 0 dpi. Using 70 instead.
> Estimating resolution as 492
>
> C:\Program Files (x86)\Tesseract-OCR>(tesseract
> d:\temp\temp\Screenshot_20230601_102708.jpg  -l eng+hin
> 1>>c:\temp\temp2.txt )
> read_params_file: Can't open eng+hin
> Tesseract Open Source OCR Engine v4.0.0.20181030 with Leptonica
> Warning: Invalid resolution 0 dpi. Using 70 instead.
> Estimating resolution as 324
> Detected 79 diacritics
>
> the c:\temp\temp2.txt is empty.
>
> Kindly help
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/5d0183a8-dc6f-42f8-b698-60ae21a1a90bn%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/5d0183a8-dc6f-42f8-b698-60ae21a1a90bn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xEQouPVmQHO75KzEcGLC9seBC0iyLdRYRdd94xSdZ-kA%40mail.gmail.com.


[tesseract-ocr] tesseract runs but gives no output

2023-07-14 Thread Ales Ropia
C:\Program Files (x86)\Tesseract-OCR>for %i in (d:\temp\temp\*.*) do 
(tesseract %i >> c:\temp\temp2.txt -l eng+hin)

C:\Program Files (x86)\Tesseract-OCR>(tesseract 
d:\temp\temp\Screenshot_20230601_102548.jpg  -l eng+hin 
1>>c:\temp\temp2.txt )
read_params_file: Can't open eng+hin
Tesseract Open Source OCR Engine v4.0.0.20181030 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 434

C:\Program Files (x86)\Tesseract-OCR>(tesseract 
d:\temp\temp\Screenshot_20230601_102638.jpg  -l eng+hin 
1>>c:\temp\temp2.txt )
read_params_file: Can't open eng+hin
Tesseract Open Source OCR Engine v4.0.0.20181030 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 492

C:\Program Files (x86)\Tesseract-OCR>(tesseract 
d:\temp\temp\Screenshot_20230601_102708.jpg  -l eng+hin 
1>>c:\temp\temp2.txt )
read_params_file: Can't open eng+hin
Tesseract Open Source OCR Engine v4.0.0.20181030 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 324
Detected 79 diacritics

the c:\temp\temp2.txt is empty.

Kindly help

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5d0183a8-dc6f-42f8-b698-60ae21a1a90bn%40googlegroups.com.


Re: [tesseract-ocr] Tesseract completely fails to recognize consolas font from high resolution image

2023-05-01 Thread Zdenko Podobny
   1. Try to use the tesseract executable if there are any problems when
   using API/tesseract wrappers
   2. Did you try image processing (as suggested by tesseract documentation?
   3. Did you try custom image segmentation? Your image seems like a table
   and the tesseract layout analyze has a problem with tables.


Zdenko


pi 28. 4. 2023 o 19:18 Are  napísal(a):

> Hello,
>
> I have this simple Tesseract code which takes the attached image and
> prints the result to the console.
> I cropped the image to only include the neccessary information (the full
> document has sensitive information). Either way, using the cropped image or
> the full one, it successfully reads most of the text, except for the text
> with the consolas font.
>
> The output I get from the attached image is: ">BUWVveAmæUw >» >> U U"
> Although, when I use the full image, it is able to read the bot
>
> I'm using the nor.traineddata, but the result is very similar with
> eng.traineddata also.
>
>
>
> Here's my code:
>
> using System;
> using Tesseract;
>
> namespace ConsoleApp1
> {
> class Program
> {
> static void Main(string[] args)
> {
> using (var engine = new TesseractEngine(@"./tessdata", "nor",
> EngineMode.Default))
> {
> using (var img =
> Pix.LoadFromFile(@"./images/unnamed2.jpg"))
> {
> using (var page = engine.Process(img))
> {
> var text = page.GetText();
> Console.WriteLine(text);
> }
> }
> }
> }
> }
> }
>
>
>
> *Here's the image:*
>
> [image: unnamed2.jpg]
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/329a8635-723f-4664-957a-0ef952094912n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wTWduXBerDWTTnCwRB28yoSKX0fibEQqa%2BMvKHojxtfA%40mail.gmail.com.


[tesseract-ocr] Tesseract completely fails to recognize consolas font from high resolution image

2023-04-28 Thread Are
Hello,

I have this simple Tesseract code which takes the attached image and prints 
the result to the console.
I cropped the image to only include the neccessary information (the full 
document has sensitive information). Either way, using the cropped image or 
the full one, it successfully reads most of the text, except for the text 
with the consolas font.

The output I get from the attached image is: ">BUWVveAmæUw >» >> U U"
Although, when I use the full image, it is able to read the bot

I'm using the nor.traineddata, but the result is very similar with 
eng.traineddata also.



Here's my code:

using System;
using Tesseract;

namespace ConsoleApp1
{
class Program
{
static void Main(string[] args)
{
using (var engine = new TesseractEngine(@"./tessdata", "nor", 
EngineMode.Default))
{
using (var img = Pix.LoadFromFile(@"./images/unnamed2.jpg"))
{
using (var page = engine.Process(img))
{
var text = page.GetText();
Console.WriteLine(text);
}
}
}
}
}
}



*Here's the image:*

[image: unnamed2.jpg]

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/329a8635-723f-4664-957a-0ef952094912n%40googlegroups.com.


Re: [tesseract-ocr] Tesseract training for New font/language

2023-04-01 Thread Ali Abedian
Is it best to train a new language? 

On Saturday, April 1, 2023 at 7:54:30 a.m. UTC-7 shree wrote:

> Aurebesh seems to be different symbols mapped to the English alphabet 
> rather than a new font for English, hence training would need to be for a 
> new language rather than just fine-tuning.
>
> On Sat, Apr 1, 2023, 10:47 Ali Abedian  wrote:
>
>> Hello,
>>
>> Thank you for providing the references, but I'm still a bit confused. I 
>> have trained tesseract using the same method as described in 
>> https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip, 
>> with 100,000 sentences and a maximum iteration of 10,000. However, it still 
>> cannot recognize a 6-letter word that I input from a TIF file using the 
>> same font and settings. I have tried using fewer iterations, such as 1,000, 
>> as well as more iterations, such as 20,000 and 100,000, but still no 
>> results. Additionally, the BCER (Character Error Rate) doesn't seem to 
>> change significantly with largere iterations, remaining at 3.56%. I'm 
>> unsure of what I'm doing wrong or what I should do next, but any help would 
>> be appreciated.
>>
>> Thank you.
>> On Saturday, April 1, 2023 at 12:05:36 a.m. UTC-7 zdenop wrote:
>>
>>> Please have a look  at https://github.com/tesseract-ocr/tesstrain 
>>> (especially 
>>> https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip)
>>>
>>>
>>> Zdenko
>>>
>>>
>>> pi 31. 3. 2023 o 7:03 Ali Abedian  napísal(a):
>>>
 Hey everyone! I'm currently working on a personal project where I'm 
 training a new font for the English language using Tesseract. The font is 
 called Aurebesh and it's from the Star Wars universe. Basically, each 
 letter in Aurebesh corresponds to a letter in English. I've collected 
 close 
 to 100,000 images and their corresponding translations, but I'm not sure 
 how many iterations I should run for a file of this size. I've tried 
 training with only 100 images, but it didn't work out. Can anyone advise 
 me 
 on how many iterations I should run and whether it's even possible to 
 train 
 a new font like this?

 -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesseract-oc...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/1b20c2e0-76b2-41a0-bc9f-e1a16b9c67a2n%40googlegroups.com
  
 
 .

>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/2cab8f1d-b81e-4926-a21b-8065a4178d04n%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/65de7c17-c593-4bba-ac92-4f7952f78509n%40googlegroups.com.


Re: [tesseract-ocr] Tesseract training for New font/language

2023-04-01 Thread Shree Devi Kumar
Aurebesh seems to be different symbols mapped to the English alphabet
rather than a new font for English, hence training would need to be for a
new language rather than just fine-tuning.

On Sat, Apr 1, 2023, 10:47 Ali Abedian  wrote:

> Hello,
>
> Thank you for providing the references, but I'm still a bit confused. I
> have trained tesseract using the same method as described in
> https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip,
> with 100,000 sentences and a maximum iteration of 10,000. However, it still
> cannot recognize a 6-letter word that I input from a TIF file using the
> same font and settings. I have tried using fewer iterations, such as 1,000,
> as well as more iterations, such as 20,000 and 100,000, but still no
> results. Additionally, the BCER (Character Error Rate) doesn't seem to
> change significantly with largere iterations, remaining at 3.56%. I'm
> unsure of what I'm doing wrong or what I should do next, but any help would
> be appreciated.
>
> Thank you.
> On Saturday, April 1, 2023 at 12:05:36 a.m. UTC-7 zdenop wrote:
>
>> Please have a look  at https://github.com/tesseract-ocr/tesstrain
>> (especially
>> https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip)
>>
>>
>> Zdenko
>>
>>
>> pi 31. 3. 2023 o 7:03 Ali Abedian  napísal(a):
>>
>>> Hey everyone! I'm currently working on a personal project where I'm
>>> training a new font for the English language using Tesseract. The font is
>>> called Aurebesh and it's from the Star Wars universe. Basically, each
>>> letter in Aurebesh corresponds to a letter in English. I've collected close
>>> to 100,000 images and their corresponding translations, but I'm not sure
>>> how many iterations I should run for a file of this size. I've tried
>>> training with only 100 images, but it didn't work out. Can anyone advise me
>>> on how many iterations I should run and whether it's even possible to train
>>> a new font like this?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/1b20c2e0-76b2-41a0-bc9f-e1a16b9c67a2n%40googlegroups.com
>>> 
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/2cab8f1d-b81e-4926-a21b-8065a4178d04n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUQWE6_ifz1ShNNGTQPQDmAb%2BtpPUQDJZNrpGMHvpdyJQ%40mail.gmail.com.


Re: [tesseract-ocr] Tesseract training for New font/language

2023-04-01 Thread Ali Abedian


Hello,

Thank you for providing the references, but I'm still a bit confused. I 
have trained tesseract using the same method as described in 
https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip, with 
100,000 sentences and a maximum iteration of 10,000. However, it still 
cannot recognize a 6-letter word that I input from a TIF file using the 
same font and settings. I have tried using fewer iterations, such as 1,000, 
as well as more iterations, such as 20,000 and 100,000, but still no 
results. Additionally, the BCER (Character Error Rate) doesn't seem to 
change significantly with largere iterations, remaining at 3.56%. I'm 
unsure of what I'm doing wrong or what I should do next, but any help would 
be appreciated.

Thank you.
On Saturday, April 1, 2023 at 12:05:36 a.m. UTC-7 zdenop wrote:

> Please have a look  at https://github.com/tesseract-ocr/tesstrain 
> (especially 
> https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip)
>
>
> Zdenko
>
>
> pi 31. 3. 2023 o 7:03 Ali Abedian  napísal(a):
>
>> Hey everyone! I'm currently working on a personal project where I'm 
>> training a new font for the English language using Tesseract. The font is 
>> called Aurebesh and it's from the Star Wars universe. Basically, each 
>> letter in Aurebesh corresponds to a letter in English. I've collected close 
>> to 100,000 images and their corresponding translations, but I'm not sure 
>> how many iterations I should run for a file of this size. I've tried 
>> training with only 100 images, but it didn't work out. Can anyone advise me 
>> on how many iterations I should run and whether it's even possible to train 
>> a new font like this?
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/1b20c2e0-76b2-41a0-bc9f-e1a16b9c67a2n%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2cab8f1d-b81e-4926-a21b-8065a4178d04n%40googlegroups.com.


Re: [tesseract-ocr] Tesseract accuracy.

2023-04-01 Thread Zdenko Podobny
As the first step, I would suggest you read
https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md

Next: LSTM model is training on words/lines of text so it could have
a problem with "code". For images like these legacy mode is perfect. E.g.:

tesseract WCAZ.png - --psm 6 --oem 0
W C A Z
tesseract DVEO.png - --psm 6 --oem 0
D V E O

The legacy engine model is available in languages files in tessdata
repository (https://github.com/tesseract-ocr/tessdata). Many
installations prefer to use fast model (without legacy model)

Zdenko


so 25. 3. 2023 o 8:39 Kyle Zeneki  napísal(a):

> Hello, I have these images and I'm trying to print their output using
> Tesseract. I spent 2 hours fine-tuning Tesseract for a specific font, and
> the error rate was 0.163. I used multiple font-detecting websites, and the
> closest match was "Futura Now." However, Tesseract sometimes fails to read
> the "E" from "D V E O" but successfully reads the "E" from "EOPEO." It also
> occasionally misreads "S E G I E" as "Ss Ee G I E." etc. I'm wondering if
> there's a way to train Tesseract by image rather than by font.
> Alternatively, is there a better tool than Tesseract, such as EasyOCR?"
> [image: capture9.png][image: capture4.png][image: capture5.png][image:
> capture6.png][image: capture7.png][image: capture8.png]
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/fffda6e4-5754-4b87-b397-0365793d8c4en%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xGoNHX0u6Xcod%3DV0_E8yXarV3rZStUdwjcr%3DXaN1WAzA%40mail.gmail.com.


Re: [tesseract-ocr] Tesseract training for New font/language

2023-04-01 Thread Zdenko Podobny
Please have a look  at https://github.com/tesseract-ocr/tesstrain
(especially
https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip)


Zdenko


pi 31. 3. 2023 o 7:03 Ali Abedian  napísal(a):

> Hey everyone! I'm currently working on a personal project where I'm
> training a new font for the English language using Tesseract. The font is
> called Aurebesh and it's from the Star Wars universe. Basically, each
> letter in Aurebesh corresponds to a letter in English. I've collected close
> to 100,000 images and their corresponding translations, but I'm not sure
> how many iterations I should run for a file of this size. I've tried
> training with only 100 images, but it didn't work out. Can anyone advise me
> on how many iterations I should run and whether it's even possible to train
> a new font like this?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/1b20c2e0-76b2-41a0-bc9f-e1a16b9c67a2n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xcreUfPHM4zX4uEcT9ytA4m1TbWMSds8keJ7G-BSNh1g%40mail.gmail.com.


[tesseract-ocr] Tesseract training for New font/language

2023-03-30 Thread Ali Abedian
Hey everyone! I'm currently working on a personal project where I'm 
training a new font for the English language using Tesseract. The font is 
called Aurebesh and it's from the Star Wars universe. Basically, each 
letter in Aurebesh corresponds to a letter in English. I've collected close 
to 100,000 images and their corresponding translations, but I'm not sure 
how many iterations I should run for a file of this size. I've tried 
training with only 100 images, but it didn't work out. Can anyone advise me 
on how many iterations I should run and whether it's even possible to train 
a new font like this?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1b20c2e0-76b2-41a0-bc9f-e1a16b9c67a2n%40googlegroups.com.


[tesseract-ocr] Tesseract accuracy.

2023-03-25 Thread Kyle Zeneki
Hello, I have these images and I'm trying to print their output using 
Tesseract. I spent 2 hours fine-tuning Tesseract for a specific font, and 
the error rate was 0.163. I used multiple font-detecting websites, and the 
closest match was "Futura Now." However, Tesseract sometimes fails to read 
the "E" from "D V E O" but successfully reads the "E" from "EOPEO." It also 
occasionally misreads "S E G I E" as "Ss Ee G I E." etc. I'm wondering if 
there's a way to train Tesseract by image rather than by font. 
Alternatively, is there a better tool than Tesseract, such as EasyOCR?"
[image: capture9.png][image: capture4.png][image: capture5.png][image: 
capture6.png][image: capture7.png][image: capture8.png]

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fffda6e4-5754-4b87-b397-0365793d8c4en%40googlegroups.com.


Re: [tesseract-ocr] Tesseract doesn't recognize some numbers from an image

2023-03-24 Thread Rodhad
Hi, i'm actually trying to do the same thing that you've already tried on 
the same kind of images. Unfortunately with no results, i've even tried to 
train a neural network to recognize the numbers but even in that case with 
bad results.
I know that have been passed a bit from your last post, but i really would 
like to know if you have found a solution.

Thanks in advance.
Il giorno lunedì 11 aprile 2022 alle 01:37:29 UTC+2 javalover ha scritto:

>
>  @zdenop: the code I posted in the @OP doesn't have do the rescaled image 
> and removed borders but I've already tried with no luck.
> I excuse myself if I called you ignorant, but doesn't your sentence "It 
> does not matter how many times you post this (SO, issue tracker)..." feel a 
> little bit aggressive for no reason against me...? Even if I try to improve 
> quality, it doesn't mean it has to work at all.
> :-)
> I acknowledge that you may have contributed to the Tesseract project, but 
> could we please keep a non arrogant tone? I'm not familiar with Tesseract. 
> It's the first time I'm using it, maybe I'm doing something wrong.
> Did you actually find the solution to my problem (does it work for you by 
> rescaling/removing borders)? If yes, I'd ask if you can post the code you 
> used, because it didn't work for me.
>
> Thanks.
> Il giorno sabato 9 aprile 2022 alle 09:40:21 UTC+2 zdenop ha scritto:
>
>> Well  
>> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md
>>
>>
>> *Rescaling *-> Optimal resolution suggests that the size of input 
>> letters should be 30-33 points.  YOU IGNORED it - your has  240  points!!!
>> "for 4.x version use dark text on light background" ->  YOU IGNORED IT.
>> *Border *-> you did not remove them
>>
>> Conclusion: you are ignoring documentation. Good luck with asking for 
>> support while blaming others for ignorance.
>>
>> Zdenko
>>
>>
>> pi 8. 4. 2022 o 0:56 javalover  napísal(a):
>>
>>> @zdenop: it doesn't matter how much arrogant/ignorant you can be, you 
>>> wouldn't have never understood that the reason I'm posting here is just 
>>> because of this 
>>> 
>>> .
>>> Thus, if you had actually read the page you linked me (Still having 
>>> problems? paragraph), you would have come to the conclusion that all 
>>> proposed solutions don't work:
>>>
>>> If you've tried the above and are still getting low accuracy results, ask 
>>> on the forum 
>>>  for 
>>> help, ideally posting an example image.
>>> that page suggested to ask here for further problems. And as you've 
>>> already said, somebody has already suggested me to work on these image 
>>> processing procedures before feeding it to Tesseract:
>>>
>>>- Rescaling 
>>>
>>> 
>>>- Binarisation 
>>>
>>> 
>>>- Noise Removal 
>>>
>>> 
>>>- Dilation / Erosion 
>>>
>>> 
>>>- Rotation / Deskewing 
>>>
>>> 
>>>- Borders 
>>>
>>> 
>>>- Transparency / Alpha channel 
>>>
>>> 
>>>- Tools / Libraries 
>>>
>>> 
>>>- Examples 
>>>
>>> 
>>>- Tables recognitions 
>>>
>>> 
>>>
>>> So, I think you're smart enough to understand why I'm still here. Thanks.
>>>
>>> Il giorno giovedì 7 aprile 2022 alle 15:14:45 UTC+2 zdenop ha scritto:
>>>
 It does not matter how many times you post this (SO, issue tracker) - 
 answer will be same: read and follow instruction in 
 https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md

 Zdenko


 št 7. 4. 2022 o 7:19 javalover  napísal(a):

> I'm deskewing an image containing a number using Projection profile 
> based skew estimation algorithm 
>  and extracting it 
> through OCR.
>
> In order to calculate the correct skew angle, we compare the maximum 
> difference between peaks and using this skew angle, thus rotate the image 
> using the correct angle to correct the skew.
>
> Each image (which has a single number) has

Re: [tesseract-ocr] tesseract returns random and spurious characters

2023-03-24 Thread Zdenko Podobny
Hello,

unless you provide a test case for reproducing problem (+ information about
tesseract, language data platform etc.), nobody could help you...

Zdenko


ut 21. 6. 2022 o 19:25 Z. Jay  napísal(a):

> We have been using a competing OCR tool and are now evaluating a switch to
> tesseract. However, when converting a png, tesseract randomly - albeit
> rarely, returns characters where there is only white space. For example,
> tesseract will return a comma or equal sign where there is only white
> space. Scrutinizing the png I do not see anything such as dirt or a spec
> which looks like anything other than white space. While this is rare and
> random, it happens enough to be a problem. Note that this does not occur
> when using our current OCR tool. I suspect someone has encountered this
> issue before and already posted the solution somewhere on this list or
> elsewhere.
>
> For reference, here is a comparison of the actual text and the text
> returned by tesseract:
> Actual:
>10/17  10/17,  PAYMENT THANK YOU $64.79CR
>
> Returned:
>10/17, 10/17,  =PAYMENT THANK YOU $64.79CR
>
> Any pointers appreciated.
>
> Thanks,
>
> --zj
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/7ab12970-6d15-42c2-bbcf-31865458d95cn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wxZCSuF5t3fMf-CF6sNqVzOKgSf0tmQ%2BmmecaR%3DY7wJw%40mail.gmail.com.


Re: [tesseract-ocr] TesserAct implementation help

2023-02-07 Thread Zdenko Podobny
Please read tesseract documentation https://github.com/tesseract-ocr/tessdoc
- there is a  simple and working example of a Tesseract implementation.

Zdenko


ut 7. 2. 2023 o 22:16 Massimo  napísal(a):

> HI,
>
> I would like to make an open source application using Tesseract with
> xamarin form.
>
> I'm new to app development, but I have an idea that I would like to
> implement using the phone's camera and OCR.
>
> The only problem is that I can't find a recent, simple and working example
> of a Tessercat implementation.
>
> You can help me?
>
> Thanks in advance.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/4e48ce37-07b7-4829-872a-09af1d0313d5n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zUFvt0ys1iPwd1nNYBLpP-w176sc74q%3DGnOFgKhJEUbg%40mail.gmail.com.


[tesseract-ocr] TesserAct implementation help

2023-02-07 Thread Massimo


HI,

I would like to make an open source application using Tesseract with 
xamarin form.

I'm new to app development, but I have an idea that I would like to 
implement using the phone's camera and OCR.

The only problem is that I can't find a recent, simple and working example 
of a Tessercat implementation.

You can help me?

Thanks in advance.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4e48ce37-07b7-4829-872a-09af1d0313d5n%40googlegroups.com.


[tesseract-ocr] Tesseract misreading issues

2023-01-27 Thread vc Jayan
Hi all

I was facing misreading issue of Tesseract OCR from beginning, after
upgraded to latest version of 5.x also, this is persisting
I have tried different image pre-processing permutations and combinations
already

Sometimes it reads 5 as S, 6 and 8 as 0, I as 1 etc.

Is there any licensed version available?
-- 
R's
vcjayan

*"When all think alike, then no one is thinking.."*

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAGy9PEMuQCMPu_fYLDP0NLYZNek9JqWhO2kCAFT4GXtj3TkPxA%40mail.gmail.com.


Re: [tesseract-ocr] tesseract 5.2.0 available on raspi4?

2022-12-30 Thread 'Topas Topas' via tesseract-ocr
Hi Zdenko,

this worked perfectly. Thanks a lot!!!


zdenop schrieb am Dienstag, 27. Dezember 2022 um 20:53:39 UTC+1:

> Hello,
>
> that commands are for IMO for Ubuntu and AFAIK Raspbian is based on 
> Debian... So you get the correct reply about not being 
> supported Raspbian/buster...
>
> Try these steps/commands for Raspberry:
>
> sudo apt update
> sudo apt install apt-transport-https
>
> sudo cp /etc/apt/sources.list /etc/apt/sources.list.bak-$(date +%Y%m%d)
> echo "deb https://notesalexp.org/tesseract-ocr5/$(lsb_release -cs)/ 
> $(lsb_release -cs) main" | sudo tee -a /etc/apt/sources.list
> sudo apt-get update -oAcquire::AllowInsecureRepositories=true
> sudo apt-get install notesalexp-keyring 
> -oAcquire::AllowInsecureRepositories=true
> sudo apt-get update
>
> sudo apt-get install tesseract-ocr
>
> Zdenko
>
>
> po 26. 12. 2022 o 23:20 'Topas Topas' via tesseract-ocr <
> tesser...@googlegroups.com> napísal(a):
>
>> Hi,
>>
>> I would like to use tesseract 5.2.0 on my raspberry pi 4.
>>
>> I execute:
>>   
>> pi@raspberrypi:~ $ sudo apt update
>> pi@raspberrypi:~ $ sudo apt upgrade
>> pi@raspberrypi:~ $ sudo add-apt-repository ppa:alex-p/tesseract-ocr5
>> Traceback (most recent call last):
>>   File "/usr/bin/add-apt-repository", line 95, in 
>> sp = SoftwareProperties(options=options)
>>   File 
>> "/usr/lib/python3/dist-packages/softwareproperties/SoftwareProperties.py", 
>> line 109, in __init__
>> self.reload_sourceslist()
>>   File 
>> "/usr/lib/python3/dist-packages/softwareproperties/SoftwareProperties.py", 
>> line 599, in reload_sourceslist
>> self.distro.get_sources(self.sourceslist)
>>   File "/usr/lib/python3/dist-packages/aptsources/distro.py", line 93, in 
>> get_sources
>> (self.id, self.codename))
>> aptsources.distro.NoDistroTemplateException: Error: could not find a 
>> distribution template for Raspbian/buster
>>  
>> Regarding this, I have some questions:
>>
>>
>>1. Do I understand it correctly, that there is no distribution for 
>>this version of tesseract available and foreseen for raspi 4?
>>2. Whom could I probably ask to supply it?
>>3. If that is not likely to come: would it be easily possible for me 
>>to build it myself for raspi?
>>4. Or would it make more sense to install ubuntu on this hardware?
>>5. Or would it be the easiest to run this version of tesseract only 
>>on a non-ARM computer?
>>
>> Thanks in advance!
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/c8e450ff-e480-47f7-abdf-12b378626d1cn%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c8284fef-23fa-4e09-8938-a7d21542de82n%40googlegroups.com.


[tesseract-ocr] Tesseract API

2022-12-30 Thread Isha Patel
Anyone can help me with using this api in postman?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fb88634b-ed7e-4371-b632-7743df45b95dn%40googlegroups.com.


Re: [tesseract-ocr] tesseract 5.2.0 available on raspi4?

2022-12-27 Thread Zdenko Podobny
Hello,

that commands are for IMO for Ubuntu and AFAIK Raspbian is based on
Debian... So you get the correct reply about not being
supported Raspbian/buster...

Try these steps/commands for Raspberry:

sudo apt update
sudo apt install apt-transport-https

sudo cp /etc/apt/sources.list /etc/apt/sources.list.bak-$(date +%Y%m%d)
echo "deb https://notesalexp.org/tesseract-ocr5/$(lsb_release -cs)/
$(lsb_release -cs) main" | sudo tee -a /etc/apt/sources.list
sudo apt-get update -oAcquire::AllowInsecureRepositories=true
sudo apt-get install notesalexp-keyring
-oAcquire::AllowInsecureRepositories=true
sudo apt-get update

sudo apt-get install tesseract-ocr

Zdenko


po 26. 12. 2022 o 23:20 'Topas Topas' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> Hi,
>
> I would like to use tesseract 5.2.0 on my raspberry pi 4.
>
> I execute:
>
> pi@raspberrypi:~ $ sudo apt update
> pi@raspberrypi:~ $ sudo apt upgrade
> pi@raspberrypi:~ $ sudo add-apt-repository ppa:alex-p/tesseract-ocr5
> Traceback (most recent call last):
>   File "/usr/bin/add-apt-repository", line 95, in 
> sp = SoftwareProperties(options=options)
>   File
> "/usr/lib/python3/dist-packages/softwareproperties/SoftwareProperties.py",
> line 109, in __init__
> self.reload_sourceslist()
>   File
> "/usr/lib/python3/dist-packages/softwareproperties/SoftwareProperties.py",
> line 599, in reload_sourceslist
> self.distro.get_sources(self.sourceslist)
>   File "/usr/lib/python3/dist-packages/aptsources/distro.py", line 93, in
> get_sources
> (self.id, self.codename))
> aptsources.distro.NoDistroTemplateException: Error: could not find a
> distribution template for Raspbian/buster
>
> Regarding this, I have some questions:
>
>
>1. Do I understand it correctly, that there is no distribution for
>this version of tesseract available and foreseen for raspi 4?
>2. Whom could I probably ask to supply it?
>3. If that is not likely to come: would it be easily possible for me
>to build it myself for raspi?
>4. Or would it make more sense to install ubuntu on this hardware?
>5. Or would it be the easiest to run this version of tesseract only on
>a non-ARM computer?
>
> Thanks in advance!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c8e450ff-e480-47f7-abdf-12b378626d1cn%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/c8e450ff-e480-47f7-abdf-12b378626d1cn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x1ubYMWa%3Dpr%3DmvyW0MK-PjprCeFv1ZuyWqOWc1VjEtug%40mail.gmail.com.


[tesseract-ocr] tesseract 5.2.0 available on raspi4?

2022-12-26 Thread 'Topas Topas' via tesseract-ocr
Hi,

I would like to use tesseract 5.2.0 on my raspberry pi 4.

I execute:
  
pi@raspberrypi:~ $ sudo apt update
pi@raspberrypi:~ $ sudo apt upgrade
pi@raspberrypi:~ $ sudo add-apt-repository ppa:alex-p/tesseract-ocr5
Traceback (most recent call last):
  File "/usr/bin/add-apt-repository", line 95, in 
sp = SoftwareProperties(options=options)
  File 
"/usr/lib/python3/dist-packages/softwareproperties/SoftwareProperties.py", 
line 109, in __init__
self.reload_sourceslist()
  File 
"/usr/lib/python3/dist-packages/softwareproperties/SoftwareProperties.py", 
line 599, in reload_sourceslist
self.distro.get_sources(self.sourceslist)
  File "/usr/lib/python3/dist-packages/aptsources/distro.py", line 93, in 
get_sources
(self.id, self.codename))
aptsources.distro.NoDistroTemplateException: Error: could not find a 
distribution template for Raspbian/buster
 
Regarding this, I have some questions:


   1. Do I understand it correctly, that there is no distribution for this 
   version of tesseract available and foreseen for raspi 4?
   2. Whom could I probably ask to supply it?
   3. If that is not likely to come: would it be easily possible for me to 
   build it myself for raspi?
   4. Or would it make more sense to install ubuntu on this hardware?
   5. Or would it be the easiest to run this version of tesseract only on a 
   non-ARM computer?

Thanks in advance!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c8e450ff-e480-47f7-abdf-12b378626d1cn%40googlegroups.com.


[tesseract-ocr] Fwd: [tesseract-ocr/tesseract] Release 5.3.0 - 5.3.0

2022-12-23 Thread Zdenko Podobny
Thank you, Stefan and other contributors for this new release!

Zdenko


-- Forwarded message -
Od: Stefan Weil 
Date: št 22. 12. 2022 o 15:18
Subject: [tesseract-ocr/tesseract] Release 5.3.0 - 5.3.0
To: tesseract-ocr/tesseract 
Cc: Subscribed 


5.3.0 <https://github.com/tesseract-ocr/tesseract/releases/tag/5.3.0>

Repository: tesseract-ocr/tesseract
<https://github.com/tesseract-ocr/tesseract> · Tag: 5.3.0
<https://github.com/tesseract-ocr/tesseract/tree/5.3.0> · Commit: 080da83
<https://github.com/tesseract-ocr/tesseract/commit/080da83cc51c4ef8b324a7e03146fe0bd7e0944b>
· Released by: stweil <https://github.com/stweil>
What's Changed

   - Fix memory issues in ScrollView::MessageReceiver by @p12tic
   <https://github.com/p12tic> in #3872
   <https://github.com/tesseract-ocr/tesseract/pull/3872>
   - autotools: Add rule for svpaint executable by @stweil
   <https://github.com/stweil> in #3873
   <https://github.com/tesseract-ocr/tesseract/pull/3873>
   - Replace call of exit function by return statement in main function by
   @stweil <https://github.com/stweil> in #3878
   <https://github.com/tesseract-ocr/tesseract/pull/3878>
   - Fix the build on CodeQL/Analyze by @arseniy-sonar
   <https://github.com/arseniy-sonar> in #3888
   <https://github.com/tesseract-ocr/tesseract/pull/3888>
   - CI: Remove Ubuntu 18.04 by @amitdo <https://github.com/amitdo> in #3902
   <https://github.com/tesseract-ocr/tesseract/pull/3902>
   - configure.ac: fix build on aarch64_be by @ffontaine
   <https://github.com/ffontaine> in #3907
   <https://github.com/tesseract-ocr/tesseract/pull/3907>
   - SW CI: Add paths filter by @amitdo <https://github.com/amitdo> in #3908
   <https://github.com/tesseract-ocr/tesseract/pull/3908>
   - Create .mailmap by @amitdo <https://github.com/amitdo> in #3910
   <https://github.com/tesseract-ocr/tesseract/pull/3910>
   - Fix tesseract.pc from cmake to match autotools by @jeroen
   <https://github.com/jeroen> in #3930
   <https://github.com/tesseract-ocr/tesseract/pull/3930>
   - Update README.md by @nicholasz2510 <https://github.com/nicholasz2510>
   in #3935 <https://github.com/tesseract-ocr/tesseract/pull/3935>
   - Fixed 2 errors by @Gitoffthelawn <https://github.com/Gitoffthelawn> in
   #3938 <https://github.com/tesseract-ocr/tesseract/pull/3938>
   - fix issue #3940
   <https://github.com/tesseract-ocr/tesseract/issues/3940> - remove
   colormap before thresholding by @zdenop <https://github.com/zdenop> in
   #3942 <https://github.com/tesseract-ocr/tesseract/pull/3942>
   - Update upload-artifact action by @rettinghaus
   <https://github.com/rettinghaus> in #3949
   <https://github.com/tesseract-ocr/tesseract/pull/3949>
   - Update checkout action to version 3 by @rettinghaus
   <https://github.com/rettinghaus> in #3948
   <https://github.com/tesseract-ocr/tesseract/pull/3948>
   - Fix Markdownlint by @Saibamen <https://github.com/Saibamen> in #3950
   <https://github.com/tesseract-ocr/tesseract/pull/3950>
   - Fix broken links in CONTRIBUTING.md by @doraeric
   <https://github.com/doraeric> in #3951
   <https://github.com/tesseract-ocr/tesseract/pull/3951>
   - pdfrenderer.cpp: Ignore non-text blocks by @amitdo
   <https://github.com/amitdo> in #3959
   <https://github.com/tesseract-ocr/tesseract/pull/3959>
   - lstm.train: allow .box from .raw.png too by @bertsky
   <https://github.com/bertsky> in #3962
   <https://github.com/tesseract-ocr/tesseract/pull/3962>
   - Fix a number of performance issues (reported by Coverity Scan) by
   @stweil <https://github.com/stweil> in #3967
   <https://github.com/tesseract-ocr/tesseract/pull/3967>
   - Fix training tools for legacy engine (issue #3925
   <https://github.com/tesseract-ocr/tesseract/issues/3925>) by @stweil
   <https://github.com/stweil> in #3970
   <https://github.com/tesseract-ocr/tesseract/pull/3970>
   - Fix function tesseract::WriteFeature (issue #3925
   <https://github.com/tesseract-ocr/tesseract/issues/3925>) by @stweil
   <https://github.com/stweil> in #3972
   <https://github.com/tesseract-ocr/tesseract/pull/3972>
   - Modernize function ObjectCache::DeleteUnusedObjects (fix issue with s…
   by @stweil <https://github.com/stweil> in #3978
   <https://github.com/tesseract-ocr/tesseract/pull/3978>
   - More fixes for issue #3925
   <https://github.com/tesseract-ocr/tesseract/issues/3925> by @stweil
   <https://github.com/stweil> in #3977
   <https://github.com/tesseract-ocr/tesseract/pull/3977>

New Contributors

   - @p12tic <https://github.com/p12tic> made their first contribution in
   #3872 <https://github.com/tesseract-ocr/tesseract/pull/3872>
   - @arseniy-sona

[tesseract-ocr] Tesseract assigns wrong font size

2022-12-16 Thread Kehinde Adeoya
I'm using Tesseract-3.0.5, and Tessdata-3.0.4
I have trained the font successfully., and Tesseract recognizes the 
properties of the fonts. I have 2 fonts trained, namely: Ubuntu, and Inter. 
Tesseract assigns appropriate properties to Ubuntu font but misses 
sometimes when assigning font-size to Inter. for fonts of size 16, it 
assigns 31 to them, and I'm wondering why. Has anyone experienced this? 
What are the measures to ensure it assigns the right font size?  

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2f855984-6e39-41ef-9b8f-62f57cec2aeen%40googlegroups.com.


[tesseract-ocr] Tesseract improve prediction accuracy

2022-12-02 Thread Kehinde Adeoya
Environment
   
   - *Tesseract Version*: 5.0.1-1.5.7, Tessdata: 3.04, Langdata: 3.04
   - *Platform*: 21.5.0 Darwin Kernel Version 21.5.0: 
   root:xnu-8020.121.3~4/RELEASE_X86_64 x86_64 i386 Darwin

Current Behavior:

Tesseract is unable to differentiate between font weights. After training a 
font, in the project, there are varying font weights used from 100, 200, to 
900. Are there provisions for how to get font-weight as attributes as it 
only returns bold? There is no way to check the weights.

Passport 


Secondly, Tesseract seems unstable in predictions. I have done all that has 
been recommended to improve accuracy and yet the prediction seems 
indefinite. The image above is a prime example, there are times it'll see 
it as bold, which is correct. In the next run, it might start seeing it as 
a normal font. The font-weight is 700, which interprets as bold. I have run 
the same test case more than 10 times, and the result could be bold=6, 
normalfont=4.
Expected Behavior:

It should be consistent in prediction and differentiate between 
font-weights.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bdc26147-d463-42ab-bc41-0c8de16654dan%40googlegroups.com.


[tesseract-ocr] tesseract 4.1.1 slow in aws instance centos7

2022-11-09 Thread James Lian
Hi all,

i have installed tesseract 4.1.1 to aws instance centos7.

We have noticed that there is slowness in comparison to our personal laptop.

Is there anything i can do to run faster?

The  ec2 instance is having 4 vcpu and 16gb memory

tesseract 4.1.1

 leptonica-1.78.0

  libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 
1.2.7 : libwebp 0.3.0

 Found AVX512BW

 Found AVX512F

 Found AVX2

 Found AVX

 Found FMA

 Found SSE



-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a8e07344-a13a-494c-bd89-d4dfa34006f7n%40googlegroups.com.


Re: [tesseract-ocr] Tesseract character recognition and C++

2022-09-26 Thread Zdenko Podobny
If there would be a magic function for improving accuracy that works for
any case we would implement it years ago.
Please read relevant documentation where we collected tesseract best
experiences.

Image preprocessing depends on the image you did not show (instead of that
you posted not formatted code) - please think before sending a post to
(any) forum.

Zdenko


so 24. 9. 2022 o 20:36 Fish Money  napísal(a):

>
> 
>
> I read single characters with C++ and tesseract/leptonica. Below is chunk
> I use.
>
> is it possible to improve accuracy to use some special function, in case
> if I read only one character. Of course I have openCV mat with cropped
> image to fit the character size.
>
> char *outText1;
> tesseract::TessBaseAPI *api1 = new tesseract::TessBaseAPI(); if
> (api1->Init(NULL, "eng")) { fprintf(stderr, "Could not initialize
> tesseract.\n"); exit(1); } api1->SetImage((uchar*)imgWarpCopy1.data,
> imgWarpCopy1.size().width, imgWarpCopy1.size().height,
> imgWarpCopy1.channels(), imgWarpCopy1.step1()); outText1 =
> api1->GetUTF8Text(); int temp1 = strlen(outText1); string_read =
> outText1[temp1-3]; api1->End(); delete api1; delete [] outText1;
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/8086612a-cc2d-4549-ab6b-71235dcb67aen%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yKQ1TckvLS-ThdNTKP6H4ds3VnXuR6ZOOGNXDPrqx80w%40mail.gmail.com.


[tesseract-ocr] Tesseract character recognition and C++

2022-09-24 Thread Fish Money



I read single characters with C++ and tesseract/leptonica. Below is chunk I 
use.

is it possible to improve accuracy to use some special function, in case if 
I read only one character. Of course I have openCV mat with cropped image 
to fit the character size.

char *outText1;
tesseract::TessBaseAPI *api1 = new tesseract::TessBaseAPI(); if 
(api1->Init(NULL, "eng")) { fprintf(stderr, "Could not initialize 
tesseract.\n"); exit(1); } api1->SetImage((uchar*)imgWarpCopy1.data, 
imgWarpCopy1.size().width, imgWarpCopy1.size().height, 
imgWarpCopy1.channels(), imgWarpCopy1.step1()); outText1 = 
api1->GetUTF8Text(); int temp1 = strlen(outText1); string_read = 
outText1[temp1-3]; api1->End(); delete api1; delete [] outText1;

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8086612a-cc2d-4549-ab6b-71235dcb67aen%40googlegroups.com.


[tesseract-ocr] tesseract with user-words

2022-09-20 Thread Justin Seabrook
I know there are some similar posts - I've read them all! - but they don't 
seem to provide an answer.  I'm in  Windows 11 with Tesseract 
5.2.0.20220712.

I was having trouble applying a user word list instead of the dawg list so 
I made a very simple example with one is not correctly detected plus a 
user-words file with one entry of a close match.

So, here's the image, temp.png, which is a slightly blurred image of 
"testW0rd", and using this command:
"C:\Program Files\Tesseract-OCR\tesseract" temp.png output --psm 3
I get the result "testwurd" in output.txt.

OK, so following instructions in now when I put a file called 
eng.user-words with one entry - "testWord" in C:\Program 
Files\Tesseract-OCR\tessdata and a text file called bazaar in C:\Program 
Files\Tesseract-OCR\tessdata\configs with the following lines:
load_system_dawg F
load_freq_dawg   F
user_words_suffixuser-words
language_model_penalty_non_dict_word 1

And run again, I get the same result as before: "testwurd".  It doesn't 
seem to be using the user-words file?  Or rather since it errors if it's 
not there, it is accessing it but possibly not doing anything with it?

Any ideas why this is not working, would really appreciate some help with 
this from an expert.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b9450ec9-f943-40dd-8948-c2071e0f96f1n%40googlegroups.com.


eng.user-words
Description: Binary data


bazaar
Description: Binary data


[tesseract-ocr] Tesseract on macOS Catalyst

2022-09-07 Thread Gopinath
Hi,

Is Tesseract OCR solution supported on macOS Catalyst apps? My app works 
with both iOS and macOS Catalyst. I would like to know if this can be 
supported on both the platforms.

Regards,

Gopinath

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/65d0d57b-41f6-4bc8-99bb-08c8dd456c9bn%40googlegroups.com.


[tesseract-ocr] Tesseract unstable font property prediction

2022-09-02 Thread Kehinde Adeoya
Tesseract 3.0.5
TessData 3.0.4
Tesseract 5Java binding.

I am using Tesseract 3.0.5 in a project, which is awesome. It works 
brilliantly well. Lately, I noticed its predictability changes when the 
same code is run multiple times for the same image text. I was able to 
train new fonts in different languages. An example is this: when I run to 
get the font properties of an image, I'm getting these properties: 
font-name, bold, italic, monospace, serif, and underline. I ran it multiple 
times on the same image text, and it produces different results for the 
same image text.

The text on the image should return this result: 
Ubuntu, FALSE, FALSE, FALSE, FALSE, FALSE, PASS, but subsequent runs 
produce different results for the same text on the same image. 

RunsFont nameBoldItalicMonospaceSerifUnderline   
 Result
First run:UbuntuFALSEFALSEFALSEFALSEFALSEPASS
Second run:   Ubuntu-ItalicFALSETRUEFALSEFALSEFALSE   
 FAIL
Third run:Ubuntu-BoldTRUEFALSEFALSEFALSEFALSE   
 FAIL

Are there settings to make it more resilient and specific than changing it 
at every new run?


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0e49822e-7bde-476f-9fcb-168bad859698n%40googlegroups.com.


Re: [tesseract-ocr] Tesseract 3.04 error.

2022-08-19 Thread S J
Hi sir 

I am facing this error please help



On Thursday, 17 September 2015 at 15:45:58 UTC+5:30 Nick White wrote:

> On Wed, Sep 16, 2015 at 10:16:40PM +0530, ShreeDevi Kumar wrote:
> > If you are having trouble using it with Java, Quan maybe able to suggest 
> a
> > solution.
>
> I agree, this sounds more like a Java issue to me. I don't know Java 
> at all, but if it's treating anything that sends output to stderr as 
> failing that should be something you can easily fix by changing the 
> behaviour of your java code.
>
> Certainly compiling an older version of Tesseract (which, as Zdenko 
> says, has significantly worse OS X support) is not the correct way 
> to go.
>
> Nick
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/162d808e-9010-485f-968f-2cb40ea68605n%40googlegroups.com.


Re: [tesseract-ocr] Tesseract OCR on PDF without converting into images

2022-08-12 Thread Merlijn B.W. Wajer

Hi Banti,

On 11/08/2022 12:11, Banti Kumar wrote:

Can I use tesseract on pdf without converting pages into images?
I have some pdf pages with digital text and Images with text, I just 
want to apply ocr on images but not on the digital text regions so I can 
get better accuracy for searchable pdfs


I've been working on something similar to this, but it's not ready for 
doing exactly what you want. Basically, I have a tool to convert the 
text layers of a PDF to hOCR, one of the output formats from Tesseract. 
If you run that, and then also OCR the entire PDF with Tesseract, you 
could try to "merge" the two hOCR files into one, preferring the 
extracted text over the Tesseract text if they overlap - or do it based 
on word confidence or so.


Of course, you'll have to figure out a proper scale, since Tesseract 
requires the PDF to be rendered to an image, and the image pixels need 
to line up with the hOCR coordinates extracted from the PDF.


You can find the tool here (but keep in mind I'm still actively working 
on it / breaking things): 
https://github.com/internetarchive/archive-hocr-tools/blob/master/bin/pdf-to-hocr


I don't (yet) have a tool to merge hOCR files.

Regards,
Merlijn

--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1bd1d476-db6c-e521-cb97-c7a404a5f182%40archive.org.


Re: [tesseract-ocr] Tesseract OCR on PDF without converting into images

2022-08-12 Thread Zdenko Podobny
No.

On Thu, 11 Aug 2022, 12:11 Banti Kumar,  wrote:

> Can I use tesseract on pdf without converting pages into images?
> I have some pdf pages with digital text and Images with text, I just want
> to apply ocr on images but not on the digital text regions so I can get
> better accuracy for searchable pdfs
>
> TIA
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6e4670ed-04e9-40fe-ab7f-cd916908749an%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wmNwYgQ_2MqhFu_0dM4dcjwsWphzqtRPpsq33KGg7c%2Bw%40mail.gmail.com.


[tesseract-ocr] Tesseract OCR on PDF without converting into images

2022-08-11 Thread Banti Kumar
Can I use tesseract on pdf without converting pages into images? 
I have some pdf pages with digital text and Images with text, I just want 
to apply ocr on images but not on the digital text regions so I can get 
better accuracy for searchable pdfs

TIA

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6e4670ed-04e9-40fe-ab7f-cd916908749an%40googlegroups.com.


[tesseract-ocr] Tesseract has any ECCN(Export Control Certification Number)

2022-07-25 Thread Saddam Quraishi
Hi,

Thank you for your support.
Kindly share the below information.
Doe Tesseract has any ECCN(Export Control Certification Number)  OR it does 
not require any ECCN as it is a open source?
Quick reply will be helpful for us as we are planning to use  it in our 
produce and release it.
Thanks in advance.

Regards,
Saddam

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/06ed428b-f0c7-4de8-9653-3026613e1807n%40googlegroups.com.


[tesseract-ocr] tesseract 5.2.0 Release

2022-07-10 Thread Zdenko Podobny
Hello all,

I am proud to announce new release of tesseract OCR engine - version 5.2.0:

   - Improvements and fixes for continuous integration, autoconf and cmake
   builds.
   - Set /Os for some 32 bit MS compilers (fixes #3769).
   - Improve comments and other documentation.
   - Add initial support for Intel AVX512F.
   - Fix for very large PDF files on 32 bit hosts (fixes #3805).
   - Fix NEON detection on FreeBSD.
   - Fix regression with UZN files (fixes #3837).
   - Fix calling delete[] for memory allocated by malloc in C API.
   - Add an API function to init tesseract with traineddata from memory
   (fixes #3691).
   - Replace direct access to Leptonica internal data structures by
   function calls and support latest releases of Leptonica.
   - Replace std::regex by std::string functions (fixes issue #3830).
   - Use compiled-in TESSDATA_PREFIX also on Windows (fixes #3767).
   - Add new parameter 'invert_threshold', change the default threshold
   from 0.5 to 0.7 and mark parameter 'tessedit_do_invert' as deprecated.

Source code can be downloaded from GitHub [1].

[1] https://github.com/tesseract-ocr/tesseract/releases/tag/5.2.0
<https://github.com/tesseract-ocr/tesseract/wiki/ReleaseNotes>



Zdenko

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zTLdwZquHeHbkW_ir%2B9yenJK3aMo1zwiAf-2Ef4s50cQ%40mail.gmail.com.


Re: [tesseract-ocr] Tesseract OCR LCD digits doesn't work

2022-06-27 Thread Hervé
decimal point is not a problem, I can devide by 100 or 10 and it works :)

could you share my the whole code ? thanks

Le lundi 27 juin 2022 à 20:44:42 UTC+2, zdenop a écrit :

> not sure what are you doing, but try something like this:
>
> def autoinvert(binarized_img, tresh=0.5):
> """Invert binarized image if amount of black pixels is higher than 
> tresh.
> """
> height, width = binarized_img.shape
> non_zero = cv2.countNonZero(binarized_img)
> white_rate = non_zero/(height*width)
> if  white_rate < tresh:
> return ~binarized_img
> else:
> return binarized_img
>
> filename = 'default.png'
> test = cv2.imread(filename, cv2.IMREAD_GRAYSCALE)
> binarized = cv2.threshold(test, 0, 255, cv2.THRESH_BINARY + 
> cv2.THRESH_OTSU)[1]
> kernel = np.ones((5,5), np.uint8)
> img_erosion = cv2.dilate(autoinvert(binarized), kernel, iterations=1)
> ratio = round(40/img_erosion.shape[0], 2)
> ocr_image = cv2.resize(img_erosion, (0,0), fx=ratio, fy=ratio)
>
> output = pytesseract.image_to_string(ocr_image,
> config=f'--tessdata-dir "{tessdata}" --psm 6')
> print(output)
>
> Which produces '733 124', so there is still a problem with the decimal 
> point...
>
> Zdenko
>
>
> po 27. 6. 2022 o 13:00 Hervé  napísal(a):
>
>> Hi
>>
>> I don't achieve to have a 300dpi image, I tried with increasing picam 
>> resolution, I only have 96. I tried with 
>>
>> img = cv2.resize(img, None, fx=1.5, fy=1.5, interpolation=cv2.INTER_AREA) 
>>
>> but it only grows the image size, not the DPI.
>>
>> Thanks
>>
>>
>> Le dimanche 26 juin 2022 à 15:24:01 UTC+2, zdenop a écrit :
>>
>>> Check your tesseract version (tesseract -v). Here is mine:
>>>
>>> tesseract 5.1.0-70-g0df5
>>>  leptonica-1.83.0 (Jun 24 2022, 17:48:50) [MSC v.1929 LIB Release x64]
>>>   libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.0.91) : libpng 1.6.37 : 
>>> libtiff 4.4.0 : zlib 1.2.12 : libwebp 1.2.2 : libopenjp2 2.5.0
>>>  Found AVX2
>>>  Found AVX
>>>  Found FMA
>>>  Found SSE4.1
>>>  Found libarchive 3.5.1 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6 
>>> libzstd/1.4.9
>>>  Found libcurl/7.75.0 zlib/1.2.12 libssh2/1.10.1_DEV
>>>
>>>
>>> + try to use (eng) data file from tessdata_best[1] (also just 
>>> tessdata[2] produce a result)
>>>
>>> Regarding image: 
>>>
>>>1. I took output from your code "cv2.imwrite('pH.jpg', ph)" (jpg is 
>>>not good format for ocr)
>>>2. I opened it as grayscale and I see 2 problems covered by 
>>>documentation:
>>>   - it needs to be inverted
>>>   - it needs to be resized to the height of letters is between 
>>>   30-40 points.
>>>3. I guess sharpening (to increase space between dot and 3) 
>>>would help to recognize dot.
>>>4. Binarize/threshold image by yourself. Tesseract has some binarize 
>>>algorithms, but you can another one that better fit your case.
>>>
>>> I suggest doing image preprocessing in the image editor (to check what 
>>> helps) and then implementing it into code.
>>>
>>> [1] https://github.com/tesseract-ocr/tessdata_best
>>> [2] https://github.com/tesseract-ocr/tessdata
>>>
>>> Zdenko
>>>
>>>
>>> ne 26. 6. 2022 o 0:23 Hervé  napísal(a):
>>>
 Sorry I am really noob

 When I do : tesseract pH_treshr.png -
 I have :
 Empty page!!
 Empty page!!

 How do you achieve to have this image ? and why can't I tesseract it 
 like you ? I am on buster with tesseract 5.1

 is there a way to discuss ? discord ? 

 thanks for your patience and help

 Le samedi 25 juin 2022 à 14:34:06 UTC+2, zdenop a écrit :

> Sorry - I mean Rescaling:
>
> Tesseract works best on images which have a DPI of at least 300 dpi, 
> so it may be beneficial to resize images. For more information see the 
> FAQ.
> "Willus Dotkom" made interesting test for Optimal image resolution 
> with suggestion for optimal Height of capital letter in pixels:
> https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ
>
>
> After that, you can get output (but the dot is missing) with the 
> command line: "tesseract pH_treshr.png -"
>
> I was able to get the decimal point separator with the letsgodigital 
> data file 
> https://github.com/arturaugusto/display_ocr/blob/master/letsgodigital/letsgodigital.traineddata
> tesseract pH_treshr.png - -l letsgodigital
>
> Or  have a look at SSD https://github.com/Shreeshrii/tessdata_ssd
>
> Zdenko
>
>
> so 25. 6. 2022 o 12:17 Hervé  napísal(a):
>
>> I am on tesseract 5
>>
>> Inverting images 
>>
>> While tesseract version 3.05 (and older) handle inverted image (dark 
>> background and light text) without problem, for 4.x version use dark 
>> text 
>> on light background.
>> isn'it the same than : 
>> (thresh, im_bw) = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY 
>> | cv2.THRESH_OTSU)
>> im_bw = cv2.bitw

Re: [tesseract-ocr] Tesseract OCR LCD digits doesn't work

2022-06-27 Thread Zdenko Podobny
not sure what are you doing, but try something like this:

def autoinvert(binarized_img, tresh=0.5):
"""Invert binarized image if amount of black pixels is higher than
tresh.
"""
height, width = binarized_img.shape
non_zero = cv2.countNonZero(binarized_img)
white_rate = non_zero/(height*width)
if  white_rate < tresh:
return ~binarized_img
else:
return binarized_img

filename = 'default.png'
test = cv2.imread(filename, cv2.IMREAD_GRAYSCALE)
binarized = cv2.threshold(test, 0, 255, cv2.THRESH_BINARY +
cv2.THRESH_OTSU)[1]
kernel = np.ones((5,5), np.uint8)
img_erosion = cv2.dilate(autoinvert(binarized), kernel, iterations=1)
ratio = round(40/img_erosion.shape[0], 2)
ocr_image = cv2.resize(img_erosion, (0,0), fx=ratio, fy=ratio)

output = pytesseract.image_to_string(ocr_image,
config=f'--tessdata-dir "{tessdata}" --psm 6')
print(output)

Which produces '733 124', so there is still a problem with the decimal
point...

Zdenko


po 27. 6. 2022 o 13:00 Hervé  napísal(a):

> Hi
>
> I don't achieve to have a 300dpi image, I tried with increasing picam
> resolution, I only have 96. I tried with
>
> img = cv2.resize(img, None, fx=1.5, fy=1.5, interpolation=cv2.INTER_AREA)
>
> but it only grows the image size, not the DPI.
>
> Thanks
>
>
> Le dimanche 26 juin 2022 à 15:24:01 UTC+2, zdenop a écrit :
>
>> Check your tesseract version (tesseract -v). Here is mine:
>>
>> tesseract 5.1.0-70-g0df5
>>  leptonica-1.83.0 (Jun 24 2022, 17:48:50) [MSC v.1929 LIB Release x64]
>>   libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.0.91) : libpng 1.6.37 :
>> libtiff 4.4.0 : zlib 1.2.12 : libwebp 1.2.2 : libopenjp2 2.5.0
>>  Found AVX2
>>  Found AVX
>>  Found FMA
>>  Found SSE4.1
>>  Found libarchive 3.5.1 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6
>> libzstd/1.4.9
>>  Found libcurl/7.75.0 zlib/1.2.12 libssh2/1.10.1_DEV
>>
>>
>> + try to use (eng) data file from tessdata_best[1] (also just tessdata[2]
>> produce a result)
>>
>> Regarding image:
>>
>>1. I took output from your code "cv2.imwrite('pH.jpg', ph)" (jpg is
>>not good format for ocr)
>>2. I opened it as grayscale and I see 2 problems covered by
>>documentation:
>>   - it needs to be inverted
>>   - it needs to be resized to the height of letters is between 30-40
>>   points.
>>3. I guess sharpening (to increase space between dot and 3)
>>would help to recognize dot.
>>4. Binarize/threshold image by yourself. Tesseract has some binarize
>>algorithms, but you can another one that better fit your case.
>>
>> I suggest doing image preprocessing in the image editor (to check what
>> helps) and then implementing it into code.
>>
>> [1] https://github.com/tesseract-ocr/tessdata_best
>> [2] https://github.com/tesseract-ocr/tessdata
>>
>> Zdenko
>>
>>
>> ne 26. 6. 2022 o 0:23 Hervé  napísal(a):
>>
>>> Sorry I am really noob
>>>
>>> When I do : tesseract pH_treshr.png -
>>> I have :
>>> Empty page!!
>>> Empty page!!
>>>
>>> How do you achieve to have this image ? and why can't I tesseract it
>>> like you ? I am on buster with tesseract 5.1
>>>
>>> is there a way to discuss ? discord ?
>>>
>>> thanks for your patience and help
>>>
>>> Le samedi 25 juin 2022 à 14:34:06 UTC+2, zdenop a écrit :
>>>
 Sorry - I mean Rescaling:

 Tesseract works best on images which have a DPI of at least 300 dpi, so
 it may be beneficial to resize images. For more information see the FAQ.
 "Willus Dotkom" made interesting test for Optimal image resolution with
 suggestion for optimal Height of capital letter in pixels:
 https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ


 After that, you can get output (but the dot is missing) with the
 command line: "tesseract pH_treshr.png -"

 I was able to get the decimal point separator with the letsgodigital
 data file
 https://github.com/arturaugusto/display_ocr/blob/master/letsgodigital/letsgodigital.traineddata
 tesseract pH_treshr.png - -l letsgodigital

 Or  have a look at SSD https://github.com/Shreeshrii/tessdata_ssd

 Zdenko


 so 25. 6. 2022 o 12:17 Hervé  napísal(a):

> I am on tesseract 5
>
> Inverting images
>
> While tesseract version 3.05 (and older) handle inverted image (dark
> background and light text) without problem, for 4.x version use dark text
> on light background.
> isn'it the same than :
> (thresh, im_bw) = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY
> | cv2.THRESH_OTSU)
> im_bw = cv2.bitwise_not(im_bw)
>
> for resizing, I take my picture in full HD, do increasing resolution
> will allow tesseract to better OCR ?
>
> thanks
>
>
> Le samedi 25 juin 2022 à 11:25:50 UTC+2, zdenop a écrit :
>
>> Why you did not try more relevant hits like inverting and resizing?
>>
>> Zdenko
>>
>>
>> so 25. 6. 2022 o 10:56 Hervé  nap

Re: [tesseract-ocr] Tesseract OCR LCD digits doesn't work

2022-06-26 Thread Zdenko Podobny
Check your tesseract version (tesseract -v). Here is mine:

tesseract 5.1.0-70-g0df5
 leptonica-1.83.0 (Jun 24 2022, 17:48:50) [MSC v.1929 LIB Release x64]
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.0.91) : libpng 1.6.37 :
libtiff 4.4.0 : zlib 1.2.12 : libwebp 1.2.2 : libopenjp2 2.5.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found libarchive 3.5.1 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6 libzstd/1.4.9
 Found libcurl/7.75.0 zlib/1.2.12 libssh2/1.10.1_DEV


+ try to use (eng) data file from tessdata_best[1] (also just tessdata[2]
produce a result)

Regarding image:

   1. I took output from your code "cv2.imwrite('pH.jpg', ph)" (jpg is not
   good format for ocr)
   2. I opened it as grayscale and I see 2 problems covered by
   documentation:
  - it needs to be inverted
  - it needs to be resized to the height of letters is between 30-40
  points.
   3. I guess sharpening (to increase space between dot and 3) would help
   to recognize dot.
   4. Binarize/threshold image by yourself. Tesseract has some binarize
   algorithms, but you can another one that better fit your case.

I suggest doing image preprocessing in the image editor (to check what
helps) and then implementing it into code.

[1] https://github.com/tesseract-ocr/tessdata_best
[2] https://github.com/tesseract-ocr/tessdata

Zdenko


ne 26. 6. 2022 o 0:23 Hervé  napísal(a):

> Sorry I am really noob
>
> When I do : tesseract pH_treshr.png -
> I have :
> Empty page!!
> Empty page!!
>
> How do you achieve to have this image ? and why can't I tesseract it like
> you ? I am on buster with tesseract 5.1
>
> is there a way to discuss ? discord ?
>
> thanks for your patience and help
>
> Le samedi 25 juin 2022 à 14:34:06 UTC+2, zdenop a écrit :
>
>> Sorry - I mean Rescaling:
>>
>> Tesseract works best on images which have a DPI of at least 300 dpi, so
>> it may be beneficial to resize images. For more information see the FAQ.
>> "Willus Dotkom" made interesting test for Optimal image resolution with
>> suggestion for optimal Height of capital letter in pixels:
>> https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ
>>
>>
>> After that, you can get output (but the dot is missing) with the command
>> line: "tesseract pH_treshr.png -"
>>
>> I was able to get the decimal point separator with the letsgodigital data
>> file
>> https://github.com/arturaugusto/display_ocr/blob/master/letsgodigital/letsgodigital.traineddata
>> tesseract pH_treshr.png - -l letsgodigital
>>
>> Or  have a look at SSD https://github.com/Shreeshrii/tessdata_ssd
>>
>> Zdenko
>>
>>
>> so 25. 6. 2022 o 12:17 Hervé  napísal(a):
>>
>>> I am on tesseract 5
>>>
>>> Inverting images
>>>
>>> While tesseract version 3.05 (and older) handle inverted image (dark
>>> background and light text) without problem, for 4.x version use dark text
>>> on light background.
>>> isn'it the same than :
>>> (thresh, im_bw) = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY |
>>> cv2.THRESH_OTSU)
>>> im_bw = cv2.bitwise_not(im_bw)
>>>
>>> for resizing, I take my picture in full HD, do increasing resolution
>>> will allow tesseract to better OCR ?
>>>
>>> thanks
>>>
>>>
>>> Le samedi 25 juin 2022 à 11:25:50 UTC+2, zdenop a écrit :
>>>
 Why you did not try more relevant hits like inverting and resizing?

 Zdenko


 so 25. 6. 2022 o 10:56 Hervé  napísal(a):

> I tried gray image, black and white, and I use
>
>  custom_psm = r'--psm 7'
>
> didn't try others parameters
> Le samedi 25 juin 2022 à 10:32:14 UTC+2, zdenop a écrit :
>
>>
>>
>> so 25. 6. 2022 o 8:15 Hervé  napísal(a):
>>
>>> Hi
>>> I just tried some, without real success
>>>
>>> Please be specific: what did you try and what was the result?
>>
>>
>>
>>> could I learn digits from pictures ? maybe this font is not well
>>> recognized
>>>
>>
>> Any training is useless if the failure is at the image preprocessing
>> stage.
>>
>>
>>> thanks
>>>
>>> Le vendredi 24 juin 2022 à 17:12:44 UTC+2, zdenop a écrit :
>>>
 Did try to implement suggestion from documentation?
 https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md


 Zdenko


 pi 24. 6. 2022 o 16:59 Hervé  napísal(a):

> Hi, I need some help to make tesseract-OCR recognize digits :
> can't achieve to make this work with
>
>
> https://img.super-h.fr/images/2022/06/24/9a03414616bc4c6bd6e4bdb78e9d6783.jpg
>
> here is my code :
>
>
>
> import cv2
> import pytesseract
>
> pytesseract.pytesseract.tesseract_cmd ="C:\\Program
> Files\\Tesseract-OCR\\tesseract.exe"
>
> def process_image(img):
> #cv2.imshow('Img',img)
> #cv2.waitKey(0)
>
> ### passage en niveau de gr

Re: [tesseract-ocr] Tesseract OCR LCD digits doesn't work

2022-06-25 Thread Hervé
Sorry I am really noob

When I do : tesseract pH_treshr.png -
I have :
Empty page!!
Empty page!!

How do you achieve to have this image ? and why can't I tesseract it like 
you ? I am on buster with tesseract 5.1

is there a way to discuss ? discord ? 

thanks for your patience and help

Le samedi 25 juin 2022 à 14:34:06 UTC+2, zdenop a écrit :

> Sorry - I mean Rescaling:
>
> Tesseract works best on images which have a DPI of at least 300 dpi, so it 
> may be beneficial to resize images. For more information see the FAQ.
> "Willus Dotkom" made interesting test for Optimal image resolution with 
> suggestion for optimal Height of capital letter in pixels:
> https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ
>
>
> After that, you can get output (but the dot is missing) with the command 
> line: "tesseract pH_treshr.png -"
>
> I was able to get the decimal point separator with the letsgodigital data 
> file 
> https://github.com/arturaugusto/display_ocr/blob/master/letsgodigital/letsgodigital.traineddata
> tesseract pH_treshr.png - -l letsgodigital
>
> Or  have a look at SSD https://github.com/Shreeshrii/tessdata_ssd
>
> Zdenko
>
>
> so 25. 6. 2022 o 12:17 Hervé  napísal(a):
>
>> I am on tesseract 5
>>
>> Inverting images 
>>
>> While tesseract version 3.05 (and older) handle inverted image (dark 
>> background and light text) without problem, for 4.x version use dark text 
>> on light background.
>> isn'it the same than : 
>> (thresh, im_bw) = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY | 
>> cv2.THRESH_OTSU)
>> im_bw = cv2.bitwise_not(im_bw)
>>
>> for resizing, I take my picture in full HD, do increasing resolution will 
>> allow tesseract to better OCR ?
>>
>> thanks
>>
>>
>> Le samedi 25 juin 2022 à 11:25:50 UTC+2, zdenop a écrit :
>>
>>> Why you did not try more relevant hits like inverting and resizing?
>>>
>>> Zdenko
>>>
>>>
>>> so 25. 6. 2022 o 10:56 Hervé  napísal(a):
>>>
 I tried gray image, black and white, and I use 

  custom_psm = r'--psm 7'

 didn't try others parameters
 Le samedi 25 juin 2022 à 10:32:14 UTC+2, zdenop a écrit :

>
>
> so 25. 6. 2022 o 8:15 Hervé  napísal(a):
>
>> Hi
>> I just tried some, without real success
>>
>> Please be specific: what did you try and what was the result?
>
>  
>
>> could I learn digits from pictures ? maybe this font is not well 
>> recognized
>>
>
> Any training is useless if the failure is at the image preprocessing 
> stage.
>
>
>> thanks
>>
>> Le vendredi 24 juin 2022 à 17:12:44 UTC+2, zdenop a écrit :
>>
>>> Did try to implement suggestion from documentation?
>>> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md
>>>
>>>
>>> Zdenko
>>>
>>>
>>> pi 24. 6. 2022 o 16:59 Hervé  napísal(a):
>>>
 Hi, I need some help to make tesseract-OCR recognize digits : can't 
 achieve to make this work with

  
 https://img.super-h.fr/images/2022/06/24/9a03414616bc4c6bd6e4bdb78e9d6783.jpg
  

 here is my code : 



 import cv2
 import pytesseract

 pytesseract.pytesseract.tesseract_cmd ="C:\\Program 
 Files\\Tesseract-OCR\\tesseract.exe"

 def process_image(img):
 #cv2.imshow('Img',img)
 #cv2.waitKey(0)

 ### passage en niveau de gris
 gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
 #cv2.imshow('Img',gray)
 #v2.waitKey(0)

 ###analyse de l'image
 valeur = pytesseract.image_to_string(gray)
 print(valeur)

 ##passage en noir et blanc
 (thresh, im_bw) = cv2.threshold(gray, 128, 255, 
 cv2.THRESH_BINARY | cv2.THRESH_OTSU)
 im_bw = cv2.bitwise_not(im_bw)
 #cv2.imshow('Img',im_bw)
 #cv2.waitKey(0)
 # cv2.imwrite('ph.png',im_bw)
 print(pytesseract.image_to_string(im_bw))


 ###ouverture de l'image
 img = cv2.imread('ocr5.png')
 # cv2.imshow('Img',imgcoupee)


 ###on rogne
 imgcoupee = img[1056:1517,950:1862]
 #img = cv2.imwrite('ocrcoupee.png',imgcoupee)
 # cv2.imshow('Img',imgcoupee)

 ### decoupage de la partie correspondant au PH
 ph= img[516:625, 616:815]

 #cv2.imwrite('pH.jpg', image_pH)

 ### partie chlore
 cl = img[516:625, 882:1056]

 ### partie dÃ:copyright:faut flow
 #flow= img[1302:1398,1054:1400]

 ### process
 #process_image(imgcoupee)
 process_image(ph)
 process_image(cl)
 #process_image(flow)

 digits seems to be clear enough, but it does'nt work, if someone 

Re: [tesseract-ocr] Tesseract OCR LCD digits doesn't work

2022-06-25 Thread Zdenko Podobny
Sorry - I mean Rescaling:

Tesseract works best on images which have a DPI of at least 300 dpi, so it
may be beneficial to resize images. For more information see the FAQ.
"Willus Dotkom" made interesting test for Optimal image resolution with
suggestion for optimal Height of capital letter in pixels:
https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ


After that, you can get output (but the dot is missing) with the command
line: "tesseract pH_treshr.png -"

I was able to get the decimal point separator with the letsgodigital data
file
https://github.com/arturaugusto/display_ocr/blob/master/letsgodigital/letsgodigital.traineddata
tesseract pH_treshr.png - -l letsgodigital

Or  have a look at SSD https://github.com/Shreeshrii/tessdata_ssd

Zdenko


so 25. 6. 2022 o 12:17 Hervé  napísal(a):

> I am on tesseract 5
>
> Inverting images
>
> While tesseract version 3.05 (and older) handle inverted image (dark
> background and light text) without problem, for 4.x version use dark text
> on light background.
> isn'it the same than :
> (thresh, im_bw) = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY |
> cv2.THRESH_OTSU)
> im_bw = cv2.bitwise_not(im_bw)
>
> for resizing, I take my picture in full HD, do increasing resolution will
> allow tesseract to better OCR ?
>
> thanks
>
>
> Le samedi 25 juin 2022 à 11:25:50 UTC+2, zdenop a écrit :
>
>> Why you did not try more relevant hits like inverting and resizing?
>>
>> Zdenko
>>
>>
>> so 25. 6. 2022 o 10:56 Hervé  napísal(a):
>>
>>> I tried gray image, black and white, and I use
>>>
>>>  custom_psm = r'--psm 7'
>>>
>>> didn't try others parameters
>>> Le samedi 25 juin 2022 à 10:32:14 UTC+2, zdenop a écrit :
>>>


 so 25. 6. 2022 o 8:15 Hervé  napísal(a):

> Hi
> I just tried some, without real success
>
> Please be specific: what did you try and what was the result?



> could I learn digits from pictures ? maybe this font is not well
> recognized
>

 Any training is useless if the failure is at the image preprocessing
 stage.


> thanks
>
> Le vendredi 24 juin 2022 à 17:12:44 UTC+2, zdenop a écrit :
>
>> Did try to implement suggestion from documentation?
>> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md
>>
>>
>> Zdenko
>>
>>
>> pi 24. 6. 2022 o 16:59 Hervé  napísal(a):
>>
>>> Hi, I need some help to make tesseract-OCR recognize digits : can't
>>> achieve to make this work with
>>>
>>>
>>> https://img.super-h.fr/images/2022/06/24/9a03414616bc4c6bd6e4bdb78e9d6783.jpg
>>>
>>> here is my code :
>>>
>>>
>>>
>>> import cv2
>>> import pytesseract
>>>
>>> pytesseract.pytesseract.tesseract_cmd ="C:\\Program
>>> Files\\Tesseract-OCR\\tesseract.exe"
>>>
>>> def process_image(img):
>>> #cv2.imshow('Img',img)
>>> #cv2.waitKey(0)
>>>
>>> ### passage en niveau de gris
>>> gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
>>> #cv2.imshow('Img',gray)
>>> #v2.waitKey(0)
>>>
>>> ###analyse de l'image
>>> valeur = pytesseract.image_to_string(gray)
>>> print(valeur)
>>>
>>> ##passage en noir et blanc
>>> (thresh, im_bw) = cv2.threshold(gray, 128, 255,
>>> cv2.THRESH_BINARY | cv2.THRESH_OTSU)
>>> im_bw = cv2.bitwise_not(im_bw)
>>> #cv2.imshow('Img',im_bw)
>>> #cv2.waitKey(0)
>>> # cv2.imwrite('ph.png',im_bw)
>>> print(pytesseract.image_to_string(im_bw))
>>>
>>>
>>> ###ouverture de l'image
>>> img = cv2.imread('ocr5.png')
>>> # cv2.imshow('Img',imgcoupee)
>>>
>>>
>>> ###on rogne
>>> imgcoupee = img[1056:1517,950:1862]
>>> #img = cv2.imwrite('ocrcoupee.png',imgcoupee)
>>> # cv2.imshow('Img',imgcoupee)
>>>
>>> ### decoupage de la partie correspondant au PH
>>> ph= img[516:625, 616:815]
>>>
>>> #cv2.imwrite('pH.jpg', image_pH)
>>>
>>> ### partie chlore
>>> cl = img[516:625, 882:1056]
>>>
>>> ### partie dÃ:copyright:faut flow
>>> #flow= img[1302:1398,1054:1400]
>>>
>>> ### process
>>> #process_image(imgcoupee)
>>> process_image(ph)
>>> process_image(cl)
>>> #process_image(flow)
>>>
>>> digits seems to be clear enough, but it does'nt work, if someone
>>> could help me ?
>>>
>>> thanks !
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it,
>>> send an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/a05712a5-e6ed-411f-a072-e389ea7095efn%40googlegroups.com
>>> 

Re: [tesseract-ocr] Tesseract OCR LCD digits doesn't work

2022-06-25 Thread Hervé
I am on tesseract 5

Inverting images 

While tesseract version 3.05 (and older) handle inverted image (dark 
background and light text) without problem, for 4.x version use dark text 
on light background.
isn'it the same than : 
(thresh, im_bw) = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY | 
cv2.THRESH_OTSU)
im_bw = cv2.bitwise_not(im_bw)

for resizing, I take my picture in full HD, do increasing resolution will 
allow tesseract to better OCR ?

thanks


Le samedi 25 juin 2022 à 11:25:50 UTC+2, zdenop a écrit :

> Why you did not try more relevant hits like inverting and resizing?
>
> Zdenko
>
>
> so 25. 6. 2022 o 10:56 Hervé  napísal(a):
>
>> I tried gray image, black and white, and I use 
>>
>>  custom_psm = r'--psm 7'
>>
>> didn't try others parameters
>> Le samedi 25 juin 2022 à 10:32:14 UTC+2, zdenop a écrit :
>>
>>>
>>>
>>> so 25. 6. 2022 o 8:15 Hervé  napísal(a):
>>>
 Hi
 I just tried some, without real success

 Please be specific: what did you try and what was the result?
>>>
>>>  
>>>
 could I learn digits from pictures ? maybe this font is not well 
 recognized

>>>
>>> Any training is useless if the failure is at the image preprocessing 
>>> stage.
>>>
>>>
 thanks

 Le vendredi 24 juin 2022 à 17:12:44 UTC+2, zdenop a écrit :

> Did try to implement suggestion from documentation?
> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md
>
>
> Zdenko
>
>
> pi 24. 6. 2022 o 16:59 Hervé  napísal(a):
>
>> Hi, I need some help to make tesseract-OCR recognize digits : can't 
>> achieve to make this work with
>>
>>  
>> https://img.super-h.fr/images/2022/06/24/9a03414616bc4c6bd6e4bdb78e9d6783.jpg
>>  
>>
>> here is my code : 
>>
>>
>>
>> import cv2
>> import pytesseract
>>
>> pytesseract.pytesseract.tesseract_cmd ="C:\\Program 
>> Files\\Tesseract-OCR\\tesseract.exe"
>>
>> def process_image(img):
>> #cv2.imshow('Img',img)
>> #cv2.waitKey(0)
>>
>> ### passage en niveau de gris
>> gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
>> #cv2.imshow('Img',gray)
>> #v2.waitKey(0)
>>
>> ###analyse de l'image
>> valeur = pytesseract.image_to_string(gray)
>> print(valeur)
>>
>> ##passage en noir et blanc
>> (thresh, im_bw) = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY 
>> | cv2.THRESH_OTSU)
>> im_bw = cv2.bitwise_not(im_bw)
>> #cv2.imshow('Img',im_bw)
>> #cv2.waitKey(0)
>> # cv2.imwrite('ph.png',im_bw)
>> print(pytesseract.image_to_string(im_bw))
>>
>>
>> ###ouverture de l'image
>> img = cv2.imread('ocr5.png')
>> # cv2.imshow('Img',imgcoupee)
>>
>>
>> ###on rogne
>> imgcoupee = img[1056:1517,950:1862]
>> #img = cv2.imwrite('ocrcoupee.png',imgcoupee)
>> # cv2.imshow('Img',imgcoupee)
>>
>> ### decoupage de la partie correspondant au PH
>> ph= img[516:625, 616:815]
>>
>> #cv2.imwrite('pH.jpg', image_pH)
>>
>> ### partie chlore
>> cl = img[516:625, 882:1056]
>>
>> ### partie dÃ:copyright:faut flow
>> #flow= img[1302:1398,1054:1400]
>>
>> ### process
>> #process_image(imgcoupee)
>> process_image(ph)
>> process_image(cl)
>> #process_image(flow)
>>
>> digits seems to be clear enough, but it does'nt work, if someone 
>> could help me ?
>>
>> thanks !
>>
>> -- 
>> You received this message because you are subscribed to the Google 
>> Groups "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, 
>> send an email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/a05712a5-e6ed-411f-a072-e389ea7095efn%40googlegroups.com
>>  
>> 
>> .
>>
> -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesseract-oc...@googlegroups.com.

>>> To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/4ed81a73-0a82-426e-a35e-ba52c5ac71f1n%40googlegroups.com
  
 
 .

>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>>
> To view this discussion on the web visit 

Re: [tesseract-ocr] Tesseract OCR LCD digits doesn't work

2022-06-25 Thread Zdenko Podobny
Why you did not try more relevant hits like inverting and resizing?

Zdenko


so 25. 6. 2022 o 10:56 Hervé  napísal(a):

> I tried gray image, black and white, and I use
>
>  custom_psm = r'--psm 7'
>
> didn't try others parameters
> Le samedi 25 juin 2022 à 10:32:14 UTC+2, zdenop a écrit :
>
>>
>>
>> so 25. 6. 2022 o 8:15 Hervé  napísal(a):
>>
>>> Hi
>>> I just tried some, without real success
>>>
>>> Please be specific: what did you try and what was the result?
>>
>>
>>
>>> could I learn digits from pictures ? maybe this font is not well
>>> recognized
>>>
>>
>> Any training is useless if the failure is at the image preprocessing
>> stage.
>>
>>
>>> thanks
>>>
>>> Le vendredi 24 juin 2022 à 17:12:44 UTC+2, zdenop a écrit :
>>>
 Did try to implement suggestion from documentation?
 https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md


 Zdenko


 pi 24. 6. 2022 o 16:59 Hervé  napísal(a):

> Hi, I need some help to make tesseract-OCR recognize digits : can't
> achieve to make this work with
>
>
> https://img.super-h.fr/images/2022/06/24/9a03414616bc4c6bd6e4bdb78e9d6783.jpg
>
> here is my code :
>
>
>
> import cv2
> import pytesseract
>
> pytesseract.pytesseract.tesseract_cmd ="C:\\Program
> Files\\Tesseract-OCR\\tesseract.exe"
>
> def process_image(img):
> #cv2.imshow('Img',img)
> #cv2.waitKey(0)
>
> ### passage en niveau de gris
> gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
> #cv2.imshow('Img',gray)
> #v2.waitKey(0)
>
> ###analyse de l'image
> valeur = pytesseract.image_to_string(gray)
> print(valeur)
>
> ##passage en noir et blanc
> (thresh, im_bw) = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY
> | cv2.THRESH_OTSU)
> im_bw = cv2.bitwise_not(im_bw)
> #cv2.imshow('Img',im_bw)
> #cv2.waitKey(0)
> # cv2.imwrite('ph.png',im_bw)
> print(pytesseract.image_to_string(im_bw))
>
>
> ###ouverture de l'image
> img = cv2.imread('ocr5.png')
> # cv2.imshow('Img',imgcoupee)
>
>
> ###on rogne
> imgcoupee = img[1056:1517,950:1862]
> #img = cv2.imwrite('ocrcoupee.png',imgcoupee)
> # cv2.imshow('Img',imgcoupee)
>
> ### decoupage de la partie correspondant au PH
> ph= img[516:625, 616:815]
>
> #cv2.imwrite('pH.jpg', image_pH)
>
> ### partie chlore
> cl = img[516:625, 882:1056]
>
> ### partie dÃ:copyright:faut flow
> #flow= img[1302:1398,1054:1400]
>
> ### process
> #process_image(imgcoupee)
> process_image(ph)
> process_image(cl)
> #process_image(flow)
>
> digits seems to be clear enough, but it does'nt work, if someone could
> help me ?
>
> thanks !
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/a05712a5-e6ed-411f-a072-e389ea7095efn%40googlegroups.com
> 
> .
>
 --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>>
>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/4ed81a73-0a82-426e-a35e-ba52c5ac71f1n%40googlegroups.com
>>> 
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/eb2f2bdd-843d-4f11-83bb-d96e578ad94en%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zayyoRVx6J58yR%3DitS1J-STOPCFmJLQ63Xtm278zo5OA%40mail.gmail.com.


Re: [tesseract-ocr] Tesseract OCR LCD digits doesn't work

2022-06-25 Thread Hervé
I tried gray image, black and white, and I use 

 custom_psm = r'--psm 7'

didn't try others parameters
Le samedi 25 juin 2022 à 10:32:14 UTC+2, zdenop a écrit :

>
>
> so 25. 6. 2022 o 8:15 Hervé  napísal(a):
>
>> Hi
>> I just tried some, without real success
>>
>> Please be specific: what did you try and what was the result?
>
>  
>
>> could I learn digits from pictures ? maybe this font is not well 
>> recognized
>>
>
> Any training is useless if the failure is at the image preprocessing stage.
>
>
>> thanks
>>
>> Le vendredi 24 juin 2022 à 17:12:44 UTC+2, zdenop a écrit :
>>
>>> Did try to implement suggestion from documentation?
>>> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md
>>>
>>>
>>> Zdenko
>>>
>>>
>>> pi 24. 6. 2022 o 16:59 Hervé  napísal(a):
>>>
 Hi, I need some help to make tesseract-OCR recognize digits : can't 
 achieve to make this work with

  
 https://img.super-h.fr/images/2022/06/24/9a03414616bc4c6bd6e4bdb78e9d6783.jpg
  

 here is my code : 



 import cv2
 import pytesseract

 pytesseract.pytesseract.tesseract_cmd ="C:\\Program 
 Files\\Tesseract-OCR\\tesseract.exe"

 def process_image(img):
 #cv2.imshow('Img',img)
 #cv2.waitKey(0)

 ### passage en niveau de gris
 gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
 #cv2.imshow('Img',gray)
 #v2.waitKey(0)

 ###analyse de l'image
 valeur = pytesseract.image_to_string(gray)
 print(valeur)

 ##passage en noir et blanc
 (thresh, im_bw) = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY | 
 cv2.THRESH_OTSU)
 im_bw = cv2.bitwise_not(im_bw)
 #cv2.imshow('Img',im_bw)
 #cv2.waitKey(0)
 # cv2.imwrite('ph.png',im_bw)
 print(pytesseract.image_to_string(im_bw))


 ###ouverture de l'image
 img = cv2.imread('ocr5.png')
 # cv2.imshow('Img',imgcoupee)


 ###on rogne
 imgcoupee = img[1056:1517,950:1862]
 #img = cv2.imwrite('ocrcoupee.png',imgcoupee)
 # cv2.imshow('Img',imgcoupee)

 ### decoupage de la partie correspondant au PH
 ph= img[516:625, 616:815]

 #cv2.imwrite('pH.jpg', image_pH)

 ### partie chlore
 cl = img[516:625, 882:1056]

 ### partie dÃ:copyright:faut flow
 #flow= img[1302:1398,1054:1400]

 ### process
 #process_image(imgcoupee)
 process_image(ph)
 process_image(cl)
 #process_image(flow)

 digits seems to be clear enough, but it does'nt work, if someone could 
 help me ?

 thanks !

 -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesseract-oc...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/a05712a5-e6ed-411f-a072-e389ea7095efn%40googlegroups.com
  
 
 .

>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/4ed81a73-0a82-426e-a35e-ba52c5ac71f1n%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/eb2f2bdd-843d-4f11-83bb-d96e578ad94en%40googlegroups.com.


Re: [tesseract-ocr] Tesseract OCR LCD digits doesn't work

2022-06-25 Thread Zdenko Podobny
so 25. 6. 2022 o 8:15 Hervé  napísal(a):

> Hi
> I just tried some, without real success
>
> Please be specific: what did you try and what was the result?



> could I learn digits from pictures ? maybe this font is not well recognized
>

Any training is useless if the failure is at the image preprocessing stage.


> thanks
>
> Le vendredi 24 juin 2022 à 17:12:44 UTC+2, zdenop a écrit :
>
>> Did try to implement suggestion from documentation?
>> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md
>>
>>
>> Zdenko
>>
>>
>> pi 24. 6. 2022 o 16:59 Hervé  napísal(a):
>>
>>> Hi, I need some help to make tesseract-OCR recognize digits : can't
>>> achieve to make this work with
>>>
>>>
>>> https://img.super-h.fr/images/2022/06/24/9a03414616bc4c6bd6e4bdb78e9d6783.jpg
>>>
>>> here is my code :
>>>
>>>
>>>
>>> import cv2
>>> import pytesseract
>>>
>>> pytesseract.pytesseract.tesseract_cmd ="C:\\Program
>>> Files\\Tesseract-OCR\\tesseract.exe"
>>>
>>> def process_image(img):
>>> #cv2.imshow('Img',img)
>>> #cv2.waitKey(0)
>>>
>>> ### passage en niveau de gris
>>> gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
>>> #cv2.imshow('Img',gray)
>>> #v2.waitKey(0)
>>>
>>> ###analyse de l'image
>>> valeur = pytesseract.image_to_string(gray)
>>> print(valeur)
>>>
>>> ##passage en noir et blanc
>>> (thresh, im_bw) = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY |
>>> cv2.THRESH_OTSU)
>>> im_bw = cv2.bitwise_not(im_bw)
>>> #cv2.imshow('Img',im_bw)
>>> #cv2.waitKey(0)
>>> # cv2.imwrite('ph.png',im_bw)
>>> print(pytesseract.image_to_string(im_bw))
>>>
>>>
>>> ###ouverture de l'image
>>> img = cv2.imread('ocr5.png')
>>> # cv2.imshow('Img',imgcoupee)
>>>
>>>
>>> ###on rogne
>>> imgcoupee = img[1056:1517,950:1862]
>>> #img = cv2.imwrite('ocrcoupee.png',imgcoupee)
>>> # cv2.imshow('Img',imgcoupee)
>>>
>>> ### decoupage de la partie correspondant au PH
>>> ph= img[516:625, 616:815]
>>>
>>> #cv2.imwrite('pH.jpg', image_pH)
>>>
>>> ### partie chlore
>>> cl = img[516:625, 882:1056]
>>>
>>> ### partie dÃ:copyright:faut flow
>>> #flow= img[1302:1398,1054:1400]
>>>
>>> ### process
>>> #process_image(imgcoupee)
>>> process_image(ph)
>>> process_image(cl)
>>> #process_image(flow)
>>>
>>> digits seems to be clear enough, but it does'nt work, if someone could
>>> help me ?
>>>
>>> thanks !
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/a05712a5-e6ed-411f-a072-e389ea7095efn%40googlegroups.com
>>> 
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/4ed81a73-0a82-426e-a35e-ba52c5ac71f1n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xyoabwucKcHx6U%2BA-3RPR0oL3zQ07DxtpfRUL-oCpH0g%40mail.gmail.com.


Re: [tesseract-ocr] Tesseract OCR LCD digits doesn't work

2022-06-24 Thread Hervé
Hi
I just tried some, without real success

could I learn digits from pictures ? maybe this font is not well recognized

thanks

Le vendredi 24 juin 2022 à 17:12:44 UTC+2, zdenop a écrit :

> Did try to implement suggestion from documentation?
> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md
>
>
> Zdenko
>
>
> pi 24. 6. 2022 o 16:59 Hervé  napísal(a):
>
>> Hi, I need some help to make tesseract-OCR recognize digits : can't 
>> achieve to make this work with
>>
>>  
>> https://img.super-h.fr/images/2022/06/24/9a03414616bc4c6bd6e4bdb78e9d6783.jpg
>>  
>>
>> here is my code : 
>>
>>
>>
>> import cv2
>> import pytesseract
>>
>> pytesseract.pytesseract.tesseract_cmd ="C:\\Program 
>> Files\\Tesseract-OCR\\tesseract.exe"
>>
>> def process_image(img):
>> #cv2.imshow('Img',img)
>> #cv2.waitKey(0)
>>
>> ### passage en niveau de gris
>> gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
>> #cv2.imshow('Img',gray)
>> #v2.waitKey(0)
>>
>> ###analyse de l'image
>> valeur = pytesseract.image_to_string(gray)
>> print(valeur)
>>
>> ##passage en noir et blanc
>> (thresh, im_bw) = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY | 
>> cv2.THRESH_OTSU)
>> im_bw = cv2.bitwise_not(im_bw)
>> #cv2.imshow('Img',im_bw)
>> #cv2.waitKey(0)
>> # cv2.imwrite('ph.png',im_bw)
>> print(pytesseract.image_to_string(im_bw))
>>
>>
>> ###ouverture de l'image
>> img = cv2.imread('ocr5.png')
>> # cv2.imshow('Img',imgcoupee)
>>
>>
>> ###on rogne
>> imgcoupee = img[1056:1517,950:1862]
>> #img = cv2.imwrite('ocrcoupee.png',imgcoupee)
>> # cv2.imshow('Img',imgcoupee)
>>
>> ### decoupage de la partie correspondant au PH
>> ph= img[516:625, 616:815]
>>
>> #cv2.imwrite('pH.jpg', image_pH)
>>
>> ### partie chlore
>> cl = img[516:625, 882:1056]
>>
>> ### partie dÃ:copyright:faut flow
>> #flow= img[1302:1398,1054:1400]
>>
>> ### process
>> #process_image(imgcoupee)
>> process_image(ph)
>> process_image(cl)
>> #process_image(flow)
>>
>> digits seems to be clear enough, but it does'nt work, if someone could 
>> help me ?
>>
>> thanks !
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/a05712a5-e6ed-411f-a072-e389ea7095efn%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4ed81a73-0a82-426e-a35e-ba52c5ac71f1n%40googlegroups.com.


Re: [tesseract-ocr] Tesseract OCR LCD digits doesn't work

2022-06-24 Thread Zdenko Podobny
Did try to implement suggestion from documentation?
https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md


Zdenko


pi 24. 6. 2022 o 16:59 Hervé  napísal(a):

> Hi, I need some help to make tesseract-OCR recognize digits : can't
> achieve to make this work with
>
>
> https://img.super-h.fr/images/2022/06/24/9a03414616bc4c6bd6e4bdb78e9d6783.jpg
>
> here is my code :
>
>
>
> import cv2
> import pytesseract
>
> pytesseract.pytesseract.tesseract_cmd ="C:\\Program
> Files\\Tesseract-OCR\\tesseract.exe"
>
> def process_image(img):
> #cv2.imshow('Img',img)
> #cv2.waitKey(0)
>
> ### passage en niveau de gris
> gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
> #cv2.imshow('Img',gray)
> #v2.waitKey(0)
>
> ###analyse de l'image
> valeur = pytesseract.image_to_string(gray)
> print(valeur)
>
> ##passage en noir et blanc
> (thresh, im_bw) = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY |
> cv2.THRESH_OTSU)
> im_bw = cv2.bitwise_not(im_bw)
> #cv2.imshow('Img',im_bw)
> #cv2.waitKey(0)
> # cv2.imwrite('ph.png',im_bw)
> print(pytesseract.image_to_string(im_bw))
>
>
> ###ouverture de l'image
> img = cv2.imread('ocr5.png')
> # cv2.imshow('Img',imgcoupee)
>
>
> ###on rogne
> imgcoupee = img[1056:1517,950:1862]
> #img = cv2.imwrite('ocrcoupee.png',imgcoupee)
> # cv2.imshow('Img',imgcoupee)
>
> ### decoupage de la partie correspondant au PH
> ph= img[516:625, 616:815]
>
> #cv2.imwrite('pH.jpg', image_pH)
>
> ### partie chlore
> cl = img[516:625, 882:1056]
>
> ### partie dÃ:copyright:faut flow
> #flow= img[1302:1398,1054:1400]
>
> ### process
> #process_image(imgcoupee)
> process_image(ph)
> process_image(cl)
> #process_image(flow)
>
> digits seems to be clear enough, but it does'nt work, if someone could
> help me ?
>
> thanks !
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/a05712a5-e6ed-411f-a072-e389ea7095efn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x-JijWc-AS4SHQw7y2vR%3D4he0661%2BBTaf7XGYoh6c7oA%40mail.gmail.com.


[tesseract-ocr] Tesseract OCR LCD digits doesn't work

2022-06-24 Thread Hervé
Hi, I need some help to make tesseract-OCR recognize digits : can't achieve 
to make this work with

 https://img.super-h.fr/images/2022/06/24/9a03414616bc4c6bd6e4bdb78e9d6783.jpg 


here is my code : 



import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd ="C:\\Program 
Files\\Tesseract-OCR\\tesseract.exe"

def process_image(img):
#cv2.imshow('Img',img)
#cv2.waitKey(0)

### passage en niveau de gris
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
#cv2.imshow('Img',gray)
#v2.waitKey(0)

###analyse de l'image
valeur = pytesseract.image_to_string(gray)
print(valeur)

##passage en noir et blanc
(thresh, im_bw) = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY | 
cv2.THRESH_OTSU)
im_bw = cv2.bitwise_not(im_bw)
#cv2.imshow('Img',im_bw)
#cv2.waitKey(0)
# cv2.imwrite('ph.png',im_bw)
print(pytesseract.image_to_string(im_bw))


###ouverture de l'image
img = cv2.imread('ocr5.png')
# cv2.imshow('Img',imgcoupee)


###on rogne
imgcoupee = img[1056:1517,950:1862]
#img = cv2.imwrite('ocrcoupee.png',imgcoupee)
# cv2.imshow('Img',imgcoupee)

### decoupage de la partie correspondant au PH
ph= img[516:625, 616:815]

#cv2.imwrite('pH.jpg', image_pH)

### partie chlore
cl = img[516:625, 882:1056]

### partie dÃ:copyright:faut flow
#flow= img[1302:1398,1054:1400]

### process
#process_image(imgcoupee)
process_image(ph)
process_image(cl)
#process_image(flow)

digits seems to be clear enough, but it does'nt work, if someone could help 
me ?

thanks !

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a05712a5-e6ed-411f-a072-e389ea7095efn%40googlegroups.com.


Re: [tesseract-ocr] Tesseract confused between a character and a digit which look-alike

2022-06-24 Thread Lorenzo Bolzani
Hi Yash,
please see the example at the bottom of this page:

https://github.com/sirfz/tesserocr

and this issue about the versions (I think you need version 5.x):

https://github.com/sirfz/tesserocr/issues/166


If you have problems with tesserocr make sure it matches the tesseract
version it was compiled for:


https://github.com/sirfz/tesserocr/releases/tag/v2.5.2


The alternative choices should also be available in the XML output, if I
remember correctly.


Your input image is very tiny (text is 9 pixels tall) and there are a lot
of compression artifacts. If possibile, acquire an higher resolution image
with less compression.

Also try to MANUALLY clean the text more (with Gimp for example) to remove
the black fragments of the border or the dot on the left to see IF this
gives you better results. Also try to MANUALLY remove almost all of the
white borders.

IF any of these gives you better results you can think about how to improve
your automated pre-processing step with a clear target, like the attached
images (I did not test them).

Your image uses two background colors, you can cut the top and bottom parts
and process each fragment on its own (so adaptive thresholding does not get
confused).




Bye,

Lorenzo

Il giorno ven 24 giu 2022 alle ore 09:22 'Yash Mistry' via tesseract-ocr <
tesseract-ocr@googlegroups.com> ha scritto:

> Hi Lorenzo,
>
> Thank you for the suggestions.
>
> The first approach you suggest is not feasible for me because there is no
> certainty that at particular position specific type of data will present.
>
> I am interested in second approach, I am trying to find any functionality
> of tesseract which give me all possible prediction for the specific letter
> bur I haven't found any solution yet.
>
> Can you please help me from where did you find this kind of functionality
> in tesseract and of which version of tesseract?
>
> Thank you
>
> On Tuesday, June 7, 2022 at 1:45:48 PM UTC+5:30 Lorenzo Blz wrote:
>
>> Hi Yash,
>> in my experience you are going top see a lot of these errors on similar
>> characters.
>>
>>
>> Given the pre processed text only I might do the same mistake myself.
>>
>>
>> What I do is to fix these letters according to a pattern, in this case
>> WDDD
>>
>> and I replace:
>>
>> S <-> 8
>> O <-> 0
>> I  <->  1
>> i  <->  1
>> l  <->  1
>> z  <->  2
>> Z  <->  2
>> etc.
>>
>> Another options, but I'm not 100% sure if it is possible with the latest
>> version, is to ask tesseract for the whole list of predictions for each
>> token with confidence. For the first token you'd get something like:
>>
>> S: 0.6839
>> 8: 0.2123
>> B: 0.1445
>> ...
>>
>> and, again according to a pattern, you select the best matching one (you
>> need to use the choiceIterator on the result object iterating at level
>> SYMBOL). This second approach is more elegant but I do not think is giving
>> you much more over the simpler approach.
>>
>> Of course a little bit of model fine tuning helps but will not fix these
>> problems 100% and it takes a lot of time to do it properly.
>>
>>
>> I recommend using tessocr that is a real API/library wrapper (not a
>> command line wrapper...), it gives you access to the whole API and, if used
>> properly, it is a lot faster.
>>
>>
>>
>> Bye
>>
>> Lorenzo
>>
>> Il giorno mar 7 giu 2022 alle ore 09:50 'Yash Mistry' via tesseract-ocr <
>> tesser...@googlegroups.com> ha scritto:
>>
>>> I am facing challenge to extract correct a letter from a word which are
>>> look-alike, i.e 5 & S, I & 1, 8 & S.
>>>
>>> I applied image pre-processing techniques like Blurring, erode, dilate,
>>> normalised the noise, remove unnecessary component and text detection from
>>> the input image but after these much of pre-processing tesseract OCR isn't
>>> giving correct result.
>>>
>>> Please check attached images,
>>>
>>> *Original Image*
>>>
>>>
>>> *[image: image.png]*
>>>
>>> *Pre-processed Image*
>>>
>>> [image: image (1).png]
>>>
>>> *Detected Text*
>>>
>>>
>>> *[image: image (2).png]*
>>>
>>>
>>> *[image: image (3).png]*
>>>
>>> *Tesseract Configuration*
>>>
>>> -l eng --oem 1 --psm 7 -c
>>> tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n"
>>> load_system_dawg=false load_fre

Re: [tesseract-ocr] Tesseract confused between a character and a digit which look-alike

2022-06-24 Thread 'Yash Mistry' via tesseract-ocr
Hi Lorenzo,

Thank you for the suggestions.

The first approach you suggest is not feasible for me because there is no 
certainty that at particular position specific type of data will present.

I am interested in second approach, I am trying to find any functionality 
of tesseract which give me all possible prediction for the specific letter 
bur I haven't found any solution yet.

Can you please help me from where did you find this kind of functionality 
in tesseract and of which version of tesseract?

Thank you

On Tuesday, June 7, 2022 at 1:45:48 PM UTC+5:30 Lorenzo Blz wrote:

> Hi Yash,
> in my experience you are going top see a lot of these errors on similar 
> characters.
>
>
> Given the pre processed text only I might do the same mistake myself.
>
>
> What I do is to fix these letters according to a pattern, in this case 
> WDDD
>
> and I replace:
>
> S <-> 8
> O <-> 0
> I  <->  1
> i  <->  1
> l  <->  1
> z  <->  2
> Z  <->  2
> etc.
>
> Another options, but I'm not 100% sure if it is possible with the latest 
> version, is to ask tesseract for the whole list of predictions for each 
> token with confidence. For the first token you'd get something like:
>
> S: 0.6839
> 8: 0.2123
> B: 0.1445
> ...
>
> and, again according to a pattern, you select the best matching one (you 
> need to use the choiceIterator on the result object iterating at level 
> SYMBOL). This second approach is more elegant but I do not think is giving 
> you much more over the simpler approach.
>
> Of course a little bit of model fine tuning helps but will not fix these 
> problems 100% and it takes a lot of time to do it properly.
>
>
> I recommend using tessocr that is a real API/library wrapper (not a 
> command line wrapper...), it gives you access to the whole API and, if used 
> properly, it is a lot faster.
>
>
>
> Bye
>
> Lorenzo
>
> Il giorno mar 7 giu 2022 alle ore 09:50 'Yash Mistry' via tesseract-ocr <
> tesser...@googlegroups.com> ha scritto:
>
>> I am facing challenge to extract correct a letter from a word which are 
>> look-alike, i.e 5 & S, I & 1, 8 & S.
>>
>> I applied image pre-processing techniques like Blurring, erode, dilate, 
>> normalised the noise, remove unnecessary component and text detection from 
>> the input image but after these much of pre-processing tesseract OCR isn't 
>> giving correct result.
>>
>> Please check attached images,
>>
>> *Original Image*
>>
>>
>> *[image: image.png]*
>>
>> *Pre-processed Image*
>>
>> [image: image (1).png]
>>
>> *Detected Text*
>>
>>
>> *[image: image (2).png]*
>>
>>
>> *[image: image (3).png]*
>>
>> *Tesseract Configuration*
>>
>> -l eng --oem 1 --psm 7 -c 
>> tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n" 
>> load_system_dawg=false load_freq_dawg=false
>>
>> *Result of OCR*: TITLENUMBER 81003716
>>
>> As we can see OCR extract S as 8 even after pre-processing and text 
>> detection.
>>
>> Is there anyway we can overcome this problem?
>>
>> *Tesseract Version*: tesseract 5.1.0-32-gf36c0
>>
>> Note: Asked same question in pytesseract github repo and got suggestion 
>> to drop this question here.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/72dac625-d07f-4240-9032-3fa856868b8dn%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c46185ed-b502-4320-bf98-966a6b2e90een%40googlegroups.com.


[tesseract-ocr] tesseract returns random and spurious characters

2022-06-21 Thread Z. Jay
We have been using a competing OCR tool and are now evaluating a switch to 
tesseract. However, when converting a png, tesseract randomly - albeit 
rarely, returns characters where there is only white space. For example, 
tesseract will return a comma or equal sign where there is only white 
space. Scrutinizing the png I do not see anything such as dirt or a spec 
which looks like anything other than white space. While this is rare and 
random, it happens enough to be a problem. Note that this does not occur 
when using our current OCR tool. I suspect someone has encountered this 
issue before and already posted the solution somewhere on this list or 
elsewhere.

For reference, here is a comparison of the actual text and the text 
returned by tesseract:
Actual:
   10/17  10/17,  PAYMENT THANK YOU $64.79CR  

Returned:
   10/17, 10/17,  =PAYMENT THANK YOU $64.79CR  

Any pointers appreciated.

Thanks,

--zj

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7ab12970-6d15-42c2-bbcf-31865458d95cn%40googlegroups.com.


[tesseract-ocr] Tesseract Model Doc.

2022-06-13 Thread Haresh Parmar
Hello, 

I am interested in learning how tesseract LSTM model works, can anyone 
share document or link?








-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bed70c5b-d759-4638-9e69-cadece88a821n%40googlegroups.com.


[tesseract-ocr] Tesseract Offline Blazor Error.

2022-06-07 Thread Leon Komendant
Hello,
i'm trying to get an OCR function into my blazor website. Therefore i have 
a js-File(ocr.js) that is creating a Worker that should recognize the 
image. The paths are all correct. 
My Website is using https with a selfsigned certificate.

Like this, everything works as long as i'm online, but i want the same 
functions, when i dont have any internet-Connection. So i tried to add all 
path for the worker local.

*Code File ocr.js:*





































*var worker = null;var ocrobject = "ocrobject";/* call function work to 
create text using chosen image */async function imageToText(Image) {
worker = Tesseract.createWorker({workerPath: 
'lib/tesseract.js/worker.min.js',langPath: /*'source/tessdata/',*/ 
'lib/tesseract.js',corePath: 
'lib/tesseract.js/tesseract-core.wasm.js',/*add logger here*/  
  logger: m => console.log(m)});Tesseract.setLogging(true);
await work(Image);}/*function to create text from given image, using 
whitelist to specify which characters are allow in the return text*/async 
function work(img) {await worker.load();await 
worker.loadLanguage('deu');await worker.initialize('deu');await 
worker.setParameters({tessedit_char_whitelist: 
'0123456789+()&|/\.-:@abcdefghijklmnopqrstuvwxyzäöüßABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜ
 
',});let result = await worker.detect(img);
console.log(result.data);result = await worker.recognize(img);let 
line = result.data.lines.map(e => e.text);console.log(line);let 
lineJson = JSON.stringify(line);sessionStorage.setItem(ocrobject, 
lineJson);await worker.terminate();}/*Source: 
github.com/naptha/tesseract.js*/*

Now i'm getting this error when i'm trying to use the function 
"imageToText", when i'm offline:
Uncaught Error: TypeError: Failed to fetch
at createWorker.js:173:15
at e.onmessage (onMessage.js:3:5)

createworker.js is called in the file "tesseract.min.js.map". 
Do i need to add more Files to my project so that tesseract can work 
offline?Or do i need to change the 
"tesseract.min.js"/"tesseract.min.js.map"?
The files i already have you can see in the Image "Solution Explorer".

Windows 10
VS19
.NET 5.0

Thanks to everyone that is trying to help


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0076670b-73ed-4d7d-a514-1826b788f74bn%40googlegroups.com.


[tesseract-ocr] Tesseract .uzn zone file

2022-06-07 Thread Simas Skubutis
Environment
   
   - *Tesseract Version*: 5.1.0
   - *Platform*: Windows 10 64bit

*Problem with .uzn file*

After working with tesseract 4.1.0 everything worked perfectly. I used 
command tesseract inputPhotoName.png outputName -l eng --oem 1 --psm 4 
hocr and tesseract automatically picked up inputPhotoName.uzn zone file and 
returns word which I have specified with coordinates. After I upgraded 
tesseract version from 4.1.0 to 5.1.0, tesseract 5.1.0 no longer 
takes inputPhotoName.uzn zone file and returns all founded words in image.
After a lot of searching I cant find a solution how can I 
pass inputPhotoName.uzn file to tesseract 5.1.0. Maybe tesseract 5.1.0 no 
longer support .uzn files?

*inputPhotoName.uzn *file content: 215 3334 95 19
*Tesseract 4.1.0/5.1.0 **commands*: tesseract inputPhotoName.png outputName 
-l eng --oem 1 --psm 4 hocr

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e2667cdb-8211-4138-8c58-729c58b14fcbn%40googlegroups.com.


Re: [tesseract-ocr] Tesseract confused between a character and a digit which look-alike

2022-06-07 Thread Lorenzo Bolzani
Hi Yash,
in my experience you are going top see a lot of these errors on similar
characters.


Given the pre processed text only I might do the same mistake myself.


What I do is to fix these letters according to a pattern, in this case
WDDD

and I replace:

S <-> 8
O <-> 0
I  <->  1
i  <->  1
l  <->  1
z  <->  2
Z  <->  2
etc.

Another options, but I'm not 100% sure if it is possible with the latest
version, is to ask tesseract for the whole list of predictions for each
token with confidence. For the first token you'd get something like:

S: 0.6839
8: 0.2123
B: 0.1445
...

and, again according to a pattern, you select the best matching one (you
need to use the choiceIterator on the result object iterating at level
SYMBOL). This second approach is more elegant but I do not think is giving
you much more over the simpler approach.

Of course a little bit of model fine tuning helps but will not fix these
problems 100% and it takes a lot of time to do it properly.


I recommend using tessocr that is a real API/library wrapper (not a command
line wrapper...), it gives you access to the whole API and, if used
properly, it is a lot faster.



Bye

Lorenzo

Il giorno mar 7 giu 2022 alle ore 09:50 'Yash Mistry' via tesseract-ocr <
tesseract-ocr@googlegroups.com> ha scritto:

> I am facing challenge to extract correct a letter from a word which are
> look-alike, i.e 5 & S, I & 1, 8 & S.
>
> I applied image pre-processing techniques like Blurring, erode, dilate,
> normalised the noise, remove unnecessary component and text detection from
> the input image but after these much of pre-processing tesseract OCR isn't
> giving correct result.
>
> Please check attached images,
>
> *Original Image*
>
>
> *[image: image.png]*
>
> *Pre-processed Image*
>
> [image: image (1).png]
>
> *Detected Text*
>
>
> *[image: image (2).png]*
>
>
> *[image: image (3).png]*
>
> *Tesseract Configuration*
>
> -l eng --oem 1 --psm 7 -c
> tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n"
> load_system_dawg=false load_freq_dawg=false
>
> *Result of OCR*: TITLENUMBER 81003716
>
> As we can see OCR extract S as 8 even after pre-processing and text
> detection.
>
> Is there anyway we can overcome this problem?
>
> *Tesseract Version*: tesseract 5.1.0-32-gf36c0
>
> Note: Asked same question in pytesseract github repo and got suggestion to
> drop this question here.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/72dac625-d07f-4240-9032-3fa856868b8dn%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/72dac625-d07f-4240-9032-3fa856868b8dn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxhLY1FXQZAR%2Be5Cc%2Bm0p6j%3DZBaUOMz9-Bef0%3DLirW05Q%40mail.gmail.com.


[tesseract-ocr] Tesseract confused between a character and a digit which look-alike

2022-06-07 Thread 'Yash Mistry' via tesseract-ocr


I am facing challenge to extract correct a letter from a word which are 
look-alike, i.e 5 & S, I & 1, 8 & S.

I applied image pre-processing techniques like Blurring, erode, dilate, 
normalised the noise, remove unnecessary component and text detection from 
the input image but after these much of pre-processing tesseract OCR isn't 
giving correct result.

Please check attached images,

*Original Image*


*[image: image.png]*

*Pre-processed Image*

[image: image (1).png]

*Detected Text*


*[image: image (2).png]*


*[image: image (3).png]*

*Tesseract Configuration*

-l eng --oem 1 --psm 7 -c 
tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n" 
load_system_dawg=false load_freq_dawg=false

*Result of OCR*: TITLENUMBER 81003716

As we can see OCR extract S as 8 even after pre-processing and text 
detection.

Is there anyway we can overcome this problem?

*Tesseract Version*: tesseract 5.1.0-32-gf36c0

Note: Asked same question in pytesseract github repo and got suggestion to 
drop this question here.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/72dac625-d07f-4240-9032-3fa856868b8dn%40googlegroups.com.


[tesseract-ocr] Tesseract v5 architecture

2022-05-30 Thread Giridharan Kumaravelu
I am looking to understand the architecture of OCR pipeline in tesseract 
v5.0.1 to know about *the preprocessing that happen before the LSTM network 
during inference and training*. 

I could only find these 7 year old documentation notes (
https://github.com/tesseract-ocr/docs/tree/main/das_tutorial2016) and I am 
not sure if they are still accurate. 

   1. Is the information I am looking for present anywhere in the online 
   documentation (https://tesseract-ocr.github.io/tessdoc/)? 
   2. Is there a way to turn off the pagelayout analysis and other 
   preprocessing before the LSTM modules? 


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3f329911-5d88-4ca5-9089-f66b78798been%40googlegroups.com.


[tesseract-ocr] Tesseract unable to recognise Ubuntu and Inter fonts, it returned - 1809_Homer

2022-05-20 Thread Kehinde Adeoya
I have newly trained new fonts successfully. I trained Ubuntu and Inter 
fonts. I am using Tesseract 3.0.5, and Tessdata-3.0.4.

1. I noticed Tesseract does not recognize them, but kept returning a 
strange name for the fonts. It returned the 1809_Homer font name for 
Ubuntu, and Inter. This kept me wondering if there is anything wrong with 
the training.
2. Secondly, Tesseract seems not to be able to differentiate between 
font-weight: 700, and font-weight: bold. These are the same, but Tesseract 
sees font-weight: 700 as a normal font. What can I do to remedy this?

This is how I trained the new tessdata
PANGOCAIRO_BACKEND=fc sh tesstrain.sh --fontlist "Ubuntu" "Ubuntu Bold" 
"Ubuntu Bold Italic" "Ubuntu Italic" "Ubuntu Light" "Ubuntu Light Italic" 
"Ubuntu Medium" "Ubuntu Medium Italic" "Inter" "Inter Bold" "Inter Heavy" 
"Inter Light" "Inter Medium" "Inter Semi-Bold" "Inter Ultra-Bold" "Inter 
weight=250" --fonts_dir /Library/Fonts --lang nld --langdata_dir 
/tessapp/langdata --output_dir /fonts/samples --training_text 
/tessapp/langdata/nld/nld.training_text --tessdata_dir 
/tessapp/tesseract-3.05.02/tessdata --langdata_dir /tessapp/langdata

I got this as the output
nld.traineddata

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/306ed183-d2f1-439d-923b-3af3e4ca89d5n%40googlegroups.com.


[tesseract-ocr] Tesseract not detecting Ubuntu and Inter Google fonts but returning the wrong font - 1809_Homer

2022-05-20 Thread Kehinde Adeoya
I have newly trained new fonts successfully. I trained Ubuntu and Inter 
fonts. I am using Tesseract 3.0.5, and Tessdata-3.0.4.
1. I noticed Tesseract does not recognize them, but kept returning a 
strange name for the fonts. It returned the 1809_Homer font name for 
Ubuntu, and kept me wondering if there is anything wrong with the training.
2. Secondly, Tesseract seems not to be able to differentiate between 
font-weight: 700, and font-weight: bold. These are the same, but Tesseract 
sees font-weight: 700 as a normal font. What can I do to remedy this?

This is how I trained the new tessdata
PANGOCAIRO_BACKEND=fc sh tesstrain.sh --fontlist "Ubuntu" "Ubuntu Bold" 
"Ubuntu Bold Italic" "Ubuntu Italic" "Ubuntu Light" "Ubuntu Light Italic" 
"Ubuntu Medium" "Ubuntu Medium Italic" "Inter" "Inter Bold" "Inter Heavy" 
"Inter Light" "Inter Medium" "Inter Semi-Bold" "Inter Ultra-Bold" "Inter 
weight=250" --fonts_dir /Library/Fonts --lang nld --langdata_dir 
/tessapp/langdata --output_dir /fonts/samples --training_text 
/tessapp/langdata/nld/nld.training_text --tessdata_dir 
/tessapp/tesseract-3.05.02/tessdata --langdata_dir /tessapp/langdata

I got this as the output
nld.traineddata

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0302b35b-72a0-4c8b-9ffc-5d109bb2d85en%40googlegroups.com.


[tesseract-ocr] Tesseract not recognising trained fonts

2022-05-20 Thread Kehinde Adeoya
I have newly trained new fonts successfully. I trained Ubuntu and Inter 
fonts. I am using Tesseract 3.0.5, and Tessdata-3.0.4.
1. I noticed Tesseract does not recognize them, but kept returning a 
strange name for the fonts. It returned the 1809_Homer font name for 
Ubuntu, and kept me wondering if there is anything wrong with the training.
2. Secondly, Tesseract seems not to be able to differentiate between 
font-weight: 700, and font-weight: bold. These are the same, but Tesseract 
sees font-weight: 700 as a normal font. What can I do to remedy this?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/99ce17af-291c-4039-8bd1-6ca69b87c47bn%40googlegroups.com.


[tesseract-ocr] Tesseract not recognising trained font

2022-05-20 Thread Kehinde Adeoya
I have newly trained new fonts successfully. I trained Ubuntu and Inter 
fonts. Likewise, 

1. I noticed Tesseract does not recognize them, but kept returning a 
strange name for the fonts. It returned the 1809_Homer font name for 
Ubuntu, and kept me wondering if there is anything wrong with the training.
2. Secondly, Tesseract seems not to be able to differentiate between 
font-weight: 700, and font-weight: bold. These are the same, but Tesseract 
sees font-weight: 700 as a normal font. What can I do to remedy this?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f4cb33b5-188e-46e7-8d1f-34487f2f97dan%40googlegroups.com.


[tesseract-ocr] tesseract giving wrong output

2022-05-10 Thread boyapally srikanth
Hi all.
good evening ,i have been working on a project that i need to extract text 
from the live video stream ,the text i require is written as  (TM10-50%L) 
but tesseract  ocr giving it as like this   
[] ™0-50%L [ ].
below is the sample image

I need output as 
TM10-50%L
i did image processing on image like 
1.remove noise
2.adaptive threshold
3.dilate
4.sent it to tesseract ocr engine

can anyone give me any idea how i will get the exact output
Thanks in advance

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5a6548f5-975a-4064-ae74-c83de7279441n%40googlegroups.com.


[tesseract-ocr] Tesseract 5

2022-05-04 Thread MYRO STEL ANO
Hi how can I merge 2 or more trained files without using the -l lang+lang1, 
is there a way to do it ? Thanks in advance
 

-- 
*The contents of this email message and any attachments **thereto** are 
intended solely for the addressee(s) and may contain confidential and/or 
privileged information and may be legally protected from disclosure. If you 
are not the intended recipient of this message or their agent, or if this 
message has been addressed to you in error, please immediately **notify** 
the sender by reply email and delete this message and **its **attachments. 
Any unauthorized use, dissemination, copying, or storage of this message or 
its attachments **is subject to criminal and civil liability** under the 
Data Privacy Act of 2012 (RA 10173)** and the Intellectual Property Code of 
the Philippines (RA 8293), as may be applicable**.*

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/211b2c35-bbc7-4847-b7fa-179fa2d5c384n%40googlegroups.com.


[tesseract-ocr] Tesseract choices

2022-04-23 Thread Theis Borg
I'm very new to tesseract so

Is it possible to give tesseract an image and some options to choose from 
and get back a confidence level for each choice?

I.e. input:  "XYZ" "ABC"
output: "XYZ" 0.64, "ABC" 0.11

Any pointers to documentation or tutorials would be appreciated.

q:o)   Theis

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/cae2d4f6-4f62-43ea-8aa4-57336668ad3en%40googlegroups.com.


[tesseract-ocr] Tesseract Fine tuning

2022-04-20 Thread Thura Aung
I have been fine tuning the tesseract on my own handwriting photos ( 12
images )
It took a long time and still processing
keeps saying - can't encode 
Is it normal that the model took that long
I followed these steps
https://github.com/tesseract-ocr/tesstrain

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAOS7tfKUqEJSZ2TRJVtix%2BMKbvXpJQxTVhP7Mi65tPepAvbH2g%40mail.gmail.com.


Re: [tesseract-ocr] Tesseract incorrectly recognizing the "S" as an "A" here, not sure why

2022-04-13 Thread Alfredo Jr. Go
Light multicolored text on dark background. Make it black and white and 
then invert the colors. 

On Tuesday, April 12, 2022 at 2:52:57 AM UTC+8 tylerale...@gmail.com wrote:

> This doesn't help
>
> I did read that but I'm not sure what is wrong with my image, it's 
> adequate size and it isn't noisy
>
> On Monday, April 11, 2022 at 11:30:06 AM UTC-7 zdenop wrote:
>
>> Sure. Follow 
>> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md
>> and then:
>>
>> tesseract fsm_designer_preprocessed.png -
>> FSM Designer
>>
>>
>> Zdenko
>>
>>
>> po 11. 4. 2022 o 20:23 tyridge77  napísal(a):
>>
>>>
>>> [image: NameText.png]
>>> This is the source image.  696 x 112 pixels
>>>
>>> It keeps getting 
>>>
>>> "FaM Designer"
>>>
>>> Any advice? 
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/442b9eca-38fa-483c-9607-9a4a7ea70833n%40googlegroups.com
>>>  
>>> 
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6abb1095-7f22-4a6f-8f5e-99b14687949cn%40googlegroups.com.


Re: [tesseract-ocr] Tesseract incorrectly recognizing the "S" as an "A" here, not sure why

2022-04-11 Thread tyridge77
This doesn't help

I did read that but I'm not sure what is wrong with my image, it's adequate 
size and it isn't noisy

On Monday, April 11, 2022 at 11:30:06 AM UTC-7 zdenop wrote:

> Sure. Follow 
> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md
> and then:
>
> tesseract fsm_designer_preprocessed.png -
> FSM Designer
>
>
> Zdenko
>
>
> po 11. 4. 2022 o 20:23 tyridge77  napísal(a):
>
>>
>> [image: NameText.png]
>> This is the source image.  696 x 112 pixels
>>
>> It keeps getting 
>>
>> "FaM Designer"
>>
>> Any advice? 
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/442b9eca-38fa-483c-9607-9a4a7ea70833n%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/634a54d8-7c75-43be-b838-9fc66154b649n%40googlegroups.com.


Re: [tesseract-ocr] Tesseract incorrectly recognizing the "S" as an "A" here, not sure why

2022-04-11 Thread Zdenko Podobny
Sure. Follow
https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md
and then:

tesseract fsm_designer_preprocessed.png -
FSM Designer


Zdenko


po 11. 4. 2022 o 20:23 tyridge77  napísal(a):

>
> [image: NameText.png]
> This is the source image.  696 x 112 pixels
>
> It keeps getting
>
> "FaM Designer"
>
> Any advice?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/442b9eca-38fa-483c-9607-9a4a7ea70833n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xpO7xrKUAhTzMeeyqT3a88XZpvpf1qJcrSCqEw0sthaw%40mail.gmail.com.


  1   2   3   4   5   6   7   8   9   10   >