Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-09-10 Thread Des Bw
Hi mhalidu, the script you posted here seems much more extensive than you posted before: https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com . I have been using your earlier script. It is magical. How is this one different from the earlier

[tesseract-ocr] Re: How to start from scratch (new language) in Tesseract 5

2023-09-10 Thread Des Bw
I was having a bit of trouble with the directory locations: seems that TESSDATA_PREFIX worked better. *TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=oro TESSDATA=../tesseract/tessdata MAX_ITERATIONS=1* On Sunday, September 10, 2023 at 8:19:15 PM UTC+3 Des Bw wrote: >

[tesseract-ocr] How to start from scratch (new language) in Tesseract 5

2023-09-10 Thread Des Bw
I am trying to train a new language. I have prepared the all the necessary files as per the manual. I have also combined them to a trained data file using the *combine_lang_model command. * - I also have my training files such as the text files, box files and .lsmf files inside

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-09-11 Thread Des Bw
Thank you so much for putting out these brilliant scripts. They make the process much more efficient. I have one more question on the other script that you use to train. *import subprocess# List of font namesfont_names = ['ben']for font in font_names:command =

[tesseract-ocr] How to get the net_spec

2023-09-15 Thread Des Bw
For the last couple of days, I have been trying to train the amh data to include some missing characters. I have seen that Shree was able to add the Norwegian Æ by removing the top layer and training on it (https://groups.google.com/g/tesseract-ocr/c/l33zsTEPj70/m/wPzPv6HiEQAJ). I was

Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2023-09-15 Thread Des Bw
I have exactly the same problem for Amharic. I find three characters missing; and they are screwing the Ocr result. Dear Shree, can you help me please? On Friday, January 6, 2017 at 3:50:38 PM UTC+3 shree wrote: > I have uploaded modified nor.traineddata at > >

[tesseract-ocr] Re: Tesseract Custom Model Not Recognized after Training

2023-09-17 Thread Des Bw
One possibility is that you used the fast model as starter model. You need to continue or start from the best model. On Sunday, September 17, 2023 at 8:11:36 PM UTC+3 mdalihu...@gmail.com wrote: > You can try in VietOCR once and check the traineddata right now is not > corrupted. if works in

Re: [tesseract-ocr] How to get the net_spec

2023-09-17 Thread Des Bw
str:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1], >>> flags=40, iteration=286700, sample_iteration=286724, null_char=95, >>> learning_rate=0.001, momentum=0.5, adam_beta=0.999 >>> >>> amh >>> Version string:4.00.00alpha:amh >>> LSTM training info:Network >&

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-09-12 Thread Des Bw
Yes, I am familiar with the video and have set up the folder structure as you did. Indeed, I have tried a number of fine-tuning with a single font following Gracia's video. But, your script is much better because supports multiple fonts. The whole improvement you made is brilliant; and very

[tesseract-ocr] Re: How to start from scratch (new language) in Tesseract 5

2023-09-12 Thread Des Bw
at >> TESSDATA_PREFIX worked better. >> >> *TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=oro >> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000* >> On Sunday, September 10, 2023 at 8:19:15 PM UTC+3 Des Bw wrote: >> >>> I am trying to train a new lan

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-09-13 Thread Des Bw
How is your training going for Bengali? I have been trying to train from scratch. I made about 64,000 lines of text (which produced about 255,000 files, in the end) and run the training for 150,000 iterations; getting 0.51 training error rate. I was hopping to get reasonable accuracy.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-09-13 Thread Des Bw
oblem. > > It may take 24 hours or more depending on the hardware, dataset, etc. > > The training process should save intermediate models so you should be able > to stop it and resume it later from the last saved model. > > > Lorenzo > > Il giorno mer 13 set 2023

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-09-13 Thread Des Bw
The characters are getting missed, even after fine-tuning. I never made any progress. I tried many different ways. Some specific characters are always missing from the OCR result. On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 mdalihu...@gmail.com wrote: > EasyOCR I think is best

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-09-13 Thread Des Bw
The problem for regex is that Tesseract is not consistent in its replacement. Think of the original training of English data doesn't contain the letter /u/. What does Tesseract do when it faces /u/ in actual processing?? In some cases, it replaces it with closely similar letters such as /v/ and

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-09-13 Thread Des Bw
At what stage are you doing the regex replacement? My process has been: Scan (tif)--> ScanTailor --> Tesseract --> pdf >EasyOCR I think is best for ID cards or something like that image process. but document images like books, here Tesseract is better than EasyOCR. How about paddleOcr?, are you

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-09-13 Thread Des Bw
Yes, we are new to this. I find the instructions (the manual) very hard to follow. The video you linked above was really helpful to get started. My plan at the beginning was to fine tune the existing .traineddata. But, I failed in every possible way to introduce a few new characters into the

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-09-13 Thread Des Bw
I now get to 20 iterations; and the error rate is stuck at 0.46. The result is absolutely trash: nowhere close to the default/Ray's training. On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 mdalihu...@gmail.com wrote: > > after Tesseact recognizes text from images. then you can apply

Re: [tesseract-ocr] Question reg. Telugu ; char missing in ocr ; how to fix ?

2023-09-08 Thread Des Bw
I am on the same boat. I am using the latest version of Tesseract (5.3) on the Mac. The guide has mentioned a way to add (fine tune) missing characters. But, it is so very difficult to follow; has many steps ; I couldn't wrap my head around it: that I gave up after a couple of attempts. How

Re: [tesseract-ocr] Normalization failed for string

2023-09-14 Thread Des Bw
The absence of a character in the unicharset is not supposed to cause error. You have to cross-check that it is encoded in utf8. On Thursday, September 14, 2023 at 3:49:05 PM UTC+3 mdalihu...@gmail.com wrote: > We create ground-truth files that are created by every language including > these

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-09-15 Thread Des Bw
Just saw this paper: https://osf.io/b8h7q On Thursday, September 14, 2023 at 9:02:22 PM UTC+3 mdalihu...@gmail.com wrote: > I will try some changes. thx > > On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 elvi...@gmail.com > wrote: > >> I also faced that issue in the Windows. Apparently,

[tesseract-ocr] Re: trainning question

2023-09-04 Thread Des Bw
Thank you man. This is very useful. On Tuesday, July 25, 2023 at 12:01:20 PM UTC+3 mdalihu...@gmail.com wrote: > make sure the command of the training file will be under tesstrain folder > and run the first command for training data and if you train from any > checkpoint then run the second

Re: [tesseract-ocr] Unable to generate Hindi line images using text2image

2023-09-08 Thread Des Bw
Have you installed all the requirements given in the repo? It is working for me. On Tuesday, June 20, 2023 at 9:51:08 PM UTC+3 zdenop wrote: > Please follow the official training procedure [1], read the official > docs[2], or complain to the author of the tutorial you decide to follow. > > [1]

[tesseract-ocr] Re: General guidelines for training Arabic

2023-09-08 Thread Des Bw
I am also starting up with Tesseract; and not an expert by no means. But, from what I learned from reading in various places: it might good for you to increase the number of lines to get better results. The iterations are sufficient for the first round. You can increase them step by step. On

[tesseract-ocr] Re: Armenian.traineddata hye language tesseract

2023-10-15 Thread Des Bw
Check the conversation in this forum where Schree trained the Norwegian data to include the missing letter Æ. I used this method to train for Amharic; and worked for me. Basically, the method is to cut off the top layer of the network and train from there. Fine tuning doesn't work for adding

Re: [tesseract-ocr] Getting Error: No such file or directory: 'data/foo/all-lstmf'

2023-10-27 Thread Des Bw
Do you have a ground-truth? On Friday, October 27, 2023 at 6:32:38 PM UTC+3 develop...@gmail.com wrote: > > I just tried to run these all commands, but I got error > https://prnt.sc/lLHeR27J2U65 > > On Tuesday, June 6, 2023 at 10:03:17 AM UTC+2 zdenop wrote: > >> Do not create files manually.

Re: [tesseract-ocr] Getting Error: No such file or directory: 'data/foo/all-lstmf'

2023-10-27 Thread Des Bw
Do you have the right folder structure? It looks like you don't have a ground truth as well. I presume you are just starting up. You have to check this video first, and download his repo (listed in the comment of the video) to setup the folder

[tesseract-ocr] Re: Arabic characters and numbers

2023-11-01 Thread Des Bw
Doesn't the official Arabic model include the numberal? The Arabic numberals are supposed to be part of almost all the models. The Amharic model, I am working on, for example, does recognize Arabic numerals (of course, along with the regular letter characters). -- You received this message

Re: [tesseract-ocr] LSTM-based training produces .box files with the same coordinates

2023-11-01 Thread Des Bw
is the way to do it. On Wednesday, November 1, 2023 at 3:02:28 PM UTC+3 Des Bw wrote: > > I don't know what you are trying to do. I am not familiar with this method > of box generation. But, I think the command you are running is supposed to > generate them with the same coordinate

Re: [tesseract-ocr] How to generate training images with noise

2023-11-01 Thread Des Bw
I am not sure if you are supposed to use those box files for training purposes. All the guides and manuals I have read use either text2image script, or the manual method(which is presumably outdated method). On Wednesday, October 18, 2023 at 6:27:58 PM UTC+3 Keith Smith wrote: > I tried

Re: [tesseract-ocr] OCR

2023-11-01 Thread Des Bw
You need to try to process the images first. I recommend you to try ScanTailor. You can then import the processed images to Tesseract. The accuracy will improve. Are you using the official English model to ocr them? On Wednesday, November 1, 2023 at 2:18:54 PM UTC+3 zdenop wrote: > Read the

Re: [tesseract-ocr] LSTM-based training produces .box files with the same coordinates

2023-11-01 Thread Des Bw
To clarify, Shree's script is useful in case your images are not single line. If they are all single line, that script won't do much for you. On Wednesday, November 1, 2023 at 4:20:09 PM UTC+3 Des Bw wrote: > > *1. using sythetic data: * > What can you do if you do not ha

Re: [tesseract-ocr] LSTM-based training produces .box files with the same coordinates

2023-11-01 Thread Des Bw
>> was used for the legacy model. For the current model, text2image is the way >> to do it. >> >> On Wednesday, November 1, 2023 at 3:02:28 PM UTC+3 Des Bw wrote: >> >>> >>> I don't know what you are trying to do. I am not familiar with this >>&

Re: [tesseract-ocr] LSTM-based training produces .box files with the same coordinates

2023-11-01 Thread Des Bw
I don't know what you are trying to do. I am not familiar with this method of box generation. But, I think the command you are running is supposed to generate them with the same coordinates. Look at the example here: https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html On

[tesseract-ocr] Re: Help me to generate trainned data file

2023-11-02 Thread Des Bw
That is long stuff...nobody is going to read through that and help you. You need to narrow down your problem to short, clear issue so that sb can help you. Are trying to train Tesseract using those 1000 images? If you going to do that, you are less likely to succeed because 1000 images is not

Re: [tesseract-ocr] Hi there! I am looking for the open source solution to convert an image (having math equation written on it) into Latex

2023-11-02 Thread Des Bw
OCRing handwritten equations is very difficult. Even Acrobat DC is not recognizing any of the characters in your image. On Thursday, November 2, 2023 at 6:11:51 PM UTC+3 elvi...@gmail.com wrote: > This is regular ocring project. If you have accurate output with whatever > way, latex would

Re: [tesseract-ocr] Re: Dot-matrix woes

2023-11-05 Thread Des Bw
Dear piggy, can you elaborate what you did with the images please? The tools you used; and the modifications you did. I was trying to replicate what you did. But, I am not getting what you get. Is scaling up the image the same thing as increasing the DPI of the image? On Friday, November 3,

Re: Re: [tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

2023-10-29 Thread Des Bw
for a higher number of iterations in one round. On Sunday, October 29, 2023 at 3:18:32 PM UTC+3 Des Bw wrote: > > *BCER is a lie: * > > (B)CER is unrealistic measure of accuracy. I*t is a lie. * I have said it > a couple of times already. The BCER we get during the training is n

Re: Re: [tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

2023-10-29 Thread Des Bw
when the model thinks it achieved 0% error. May be can assign a negative value to it. I am going to try it if it will turn it off. On Tuesday, October 24, 2023 at 3:55:54 PM UTC+3 Des Bw wrote: > You can add *training >> data/lang.log &* to the end of your training > script (she

Re: [tesseract-ocr] Poor results of Tesseract performing a play card evaluation

2023-10-30 Thread Des Bw
How about processing the images using ScanTailor or some other tool before feeding them to Tesseract? On Monday, October 30, 2023 at 4:58:56 AM UTC+3 Art Rhyno wrote: > Maybe use a different segmentation mode? Try changing the line: > > > > text = pytesseract.image_to_string(cropped_image,

Re: Re: [tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

2023-10-31 Thread Des Bw
, 2023 at 5:46:10 PM UTC+3 Des Bw wrote: > Another lesson I learned today: starting from a smaller number of > iteration and slowly increasing it is bad. We all should train from epochs. > Every interruption is causing tesseract to re-start from the beginning. > Basically, the data that

Re: Re: [tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

2023-10-31 Thread Des Bw
, 2023 at 5:46:10 PM UTC+3 Des Bw wrote: > Another lesson I learned today: starting from a smaller number of > iteration and slowly increasing it is bad. We all should train from epochs. > Every interruption is causing tesseract to re-start from the beginning. > Basical

Re: Re: [tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

2023-10-31 Thread Des Bw
Todays lesson: it is possible to *disable TARGET_ERROR_RATE. * If you find your training stopping prematurely because it is hitting the target_error, then, you can disable it and train by epochs (iterations) only . Any negative value (such as* TARGET_ERROR_RATE=-1*) will disable the

Re: [tesseract-ocr] How to get the net_spec

2023-10-26 Thread Des Bw
Thank you for adding those improvements dear Tom. On Thursday, October 26, 2023 at 5:35:55 PM UTC+3 tfmo...@gmail.com wrote: > On Saturday, September 16, 2023 at 4:25:51 PM UTC-4 shree wrote: > > combine_tessdata(1) >

Re: Re: [tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

2023-10-30 Thread Des Bw
for training. Look at what Stefan said here: https://github.com/tesseract-ocr/tesseract/issues/3954 On Sunday, October 29, 2023 at 3:18:32 PM UTC+3 Des Bw wrote: > > *BCER is a lie: * > > (B)CER is unrealistic measure of accuracy. I*t is a lie. * I have said it > a couple of times al

[tesseract-ocr] Re: trainning question

2023-10-30 Thread Des Bw
have similar python script that could stop and resume the training. On Monday, September 4, 2023 at 2:34:04 PM UTC+3 Des Bw wrote: > Thank you man. This is very useful. > > On Tuesday, July 25, 2023 at 12:01:20 PM UTC+3 mdalihu...@gmail.com wrote: > >> make sure the command of

[tesseract-ocr] Re: Null box at index 0

2023-09-19 Thread Des Bw
I also found out that the text2image tool creates null tif images as well, resulting in "Compute CTC targets failed!" error On Monday, September 18, 2023 at 7:08:30 PM UTC+3 tesseract-ocr wrote: > I am having a lot of null box issue when I run text2image: the same way as > described here:

Re: [tesseract-ocr] quality of recognition of customer invoices

2023-09-22 Thread Des Bw
ot;, no dictionary etc.etc.etc > So, i more like a CPU usage of 99,99% and not superspeed. > > Can somebody help me ? > > Op vrijdag 22 september 2023 om 13:25:21 UTC+2 schreef desal...@gmail.com: > >> Apparently, version 4 doesn't support white listing. >> https://groups.goo

Re: [tesseract-ocr] quality of recognition of customer invoices

2023-09-22 Thread Des Bw
The difference between zero and O is deeply problematic, for the human eye. Some fonts make it even harder. You can try the method used here: https://pyimagesearch.com/2021/09/06/whitelisting-and-blacklisting-characters-with-tesseract-and-python/ if that helps. On Friday, September 22, 2023

Re: [tesseract-ocr] quality of recognition of customer invoices

2023-09-22 Thread Des Bw
Apparently, version 4 doesn't support white listing. https://groups.google.com/g/tesseract-ocr/c/IBbQIQpdSpE That is not good. On Friday, September 22, 2023 at 2:23:39 PM UTC+3 Des Bw wrote: > The difference between zero and O is deeply problematic, for the human > eye. Some fonts make i

[tesseract-ocr] Does the checkpoint_name contain the number of iterations

2023-09-20 Thread Des Bw
I couldn't understand what the numbers on the checkpoint_names are. I looked at this one: but clear to me. https://github.com/tesseract-ocr/tesseract/blob/3a7f5e4de459f4c64f36e08b18ce1b66b1fbc876/src/lstm/lstmtrainer.cpp#L410 -- You received this message because you are subscribed to the

[tesseract-ocr] Null box at index 0

2023-09-18 Thread Des Bw
I am having a lot of null box issue when I run text2image: the same way as described here: https://github.com/tesseract-ocr/tesseract/issues/2654 Since, the issue seems to be contingent with a bug in text2image; I cannot wait for sb to fix it. As a temporary solution, I have been deleting the

Re: [tesseract-ocr] Does the checkpoint_name contain the number of iterations

2023-09-20 Thread Des Bw
Thank you so much dear Shree. On Wednesday, September 20, 2023 at 4:57:52 PM UTC+3 shree wrote: > See > https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#iterations-and-checkpoints > > On Wed, Sep 20, 2023, 2:53 AM Des Bw wrote: > >> I cou

[tesseract-ocr] Re: Can't encode transcription

2023-09-26 Thread Des Bw
Are you planning to fine tune for a specific font, or want to improve the overall accuracy of the best model? On Tuesday, September 26, 2023 at 6:35:38 PM UTC+3 Des Bw wrote: > I am also training for Amharic. > I am pretty sure you are using Windows OS. I had exactly the same p

[tesseract-ocr] Re: Can't encode transcription

2023-09-26 Thread Des Bw
I am also training for Amharic. I am pretty sure you are using Windows OS. I had exactly the same problem with it. It think it is contingent with Unicode. But, I was not able to solve the issue. I now installed Ubuntu on the side; and everything works fine. On Tuesday, September 26, 2023 at

[tesseract-ocr] Cutting the top layer is deteriorating the original training

2023-09-20 Thread Des Bw
The default traineddata for Amharic is pretty accurate except it misses a handful of characters. I have been emulating what Shree did to add the Norwegian Æ to the dataset. It actually worked like charm. The problem is: I cannot get nowhere near the accuracy of the original best model. -

Re: [tesseract-ocr] quality of recognition of customer invoices

2023-09-22 Thread Des Bw
perspeed. >>> >>> Can somebody help me ? >>> >>> Op vrijdag 22 september 2023 om 13:25:21 UTC+2 schreef >>> desal...@gmail.com: >>> >>>> Apparently, version 4 doesn't support white listing. >>>> https://groups.googl

Re: [tesseract-ocr] quality of recognition of customer invoices

2023-09-22 Thread Des Bw
Shree is one of the most experienced; and definitely the most helpful member of this group. I have also seen Zdenko answering some questions. You might have a good luck with either of them. On Friday, September 22, 2023 at 4:07:12 PM UTC+3 Des Bw wrote: > If you have income source, you mi

[tesseract-ocr] BCER = 0.01; the actual result is not satisfactory

2023-10-06 Thread Des Bw
- I have been training a large amount of data: about 390,000 lines of text for each font: for 15 fonts. I run around 1.2 million iterations. The progress was encouraging to some degree. But, at some point, the BCER started to get down fast; and reached at 0.01 error. The training stopped at

[tesseract-ocr] Re: Should box include surrounding space?

2023-10-17 Thread Des Bw
If the space is included in the training across the board, the model might not recognize the comma when it appears without space (as in numbers: 23,334). On Wednesday, October 18, 2023 at 5:29:13 AM UTC+3 Danny wrote: > For purposes of training, I'm wondering if the box for a character

[tesseract-ocr] Re: Watching the learning iteration is better method than watching the BCER

2023-10-18 Thread Des Bw
In other words, the BCER is an unreliable measure of accuracy. At least, that is my experience training from synthetic data. On Wednesday, October 18, 2023 at 10:10:00 AM UTC+3 Des Bw wrote: > I am just writing a little observation here for beginners like me. > ( would love to be cor

[tesseract-ocr] Watching the learning iteration is better method than watching the BCER

2023-10-18 Thread Des Bw
I am just writing a little observation here for beginners like me. ( would love to be corrected if I am wrong). I am training by cutting the top layer of a best model; to improve the existing model. I have about 400,000 lines of texts; and generated the box and images files using text2image.

[tesseract-ocr] Re: Should box include surrounding space?

2023-10-18 Thread Des Bw
But, if your options are only to manually edit the boxes, I really have no knowledge of it. I have never tried that route. On Wednesday, October 18, 2023 at 3:43:51 PM UTC+3 Des Bw wrote: > You need a large data. That is all. > If you can collect a lot of text lines that contain all

[tesseract-ocr] Re: Should box include surrounding space?

2023-10-18 Thread Des Bw
You need a large data. That is all. If you can collect a lot of text lines that contain all those types of commas: and produce the training material using text2image (synthetic data) for each font, I am pretty sure Tesseract will learn all of them with no problem. On Wednesday, October 18,

Re: [tesseract-ocr] Re: Armenian.traineddata hye language tesseract

2023-10-20 Thread Des Bw
with various set ups and see the outcomes. On Friday, October 20, 2023 at 3:43:04 PM UTC+3 Des Bw wrote: > >- Fine tune. Starting with an existing trained language, train on your >specific additional data. This may work for problems that are close to the >existing t

Re: [tesseract-ocr] Re: Armenian.traineddata hye language tesseract

2023-10-20 Thread Des Bw
I find a documentation about this process somewhere ? > I am a tesseract user not (yet) a tesseract specialist. > > Le dim. 15 oct. 2023 à 08:39, Des Bw a écrit : > >> Check the conversation in this forum where Schree trained the Norwegian >> data to include the missing letter

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-10-19 Thread Des Bw
Hi Ali, How is your training going? Do you get good results with the training-from-the-scratch? On Friday, September 15, 2023 at 6:42:26 PM UTC+3 tesseract-ocr wrote: > yes, two months ago when I started to learn OCR I saw that. it was very > helpful at the beginning. > On Friday, 15

[tesseract-ocr] Nearly 99% accuracy

2023-10-19 Thread Des Bw
I am getting nearly 99% accuracy by training from the top layer of the network. I am training using synthetic data; and the evaluation is done the same type of data. But, the result is not extending to actually scanned documents. On the scanned documents, I am getting lower accuracy,

[tesseract-ocr] Lessons, hacks, best practices, lessons, recommendations

2023-10-21 Thread Des Bw
There is no exhaustive user manual for training tesseract. We all start in the darkness; and accumulate bits of information in different places to learn the ins and outs of tesseract. It would be great if we can collectively write a better manual. Up until then, we can drop /collect our

[tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

2023-10-21 Thread Des Bw
There is no exhaustive user manual for training tesseract. We all start in the darkness; and accumulate bits of information in different places to learn the ins and outs of tesseract. It would be great if we can collectively write a better manual. Up until then, we can drop /collect our

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-10-20 Thread Des Bw
Yah, that is what I am getting as well. I was able to add the missing letter. But, the overall accuracy become lower than the default model. On Saturday, October 21, 2023 at 3:22:44 AM UTC+3 mdalihu...@gmail.com wrote: > not good result. that's way i stop to training now. default traineddata

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-10-20 Thread Des Bw
How many lines of text and iterations did you use? On Saturday, October 21, 2023 at 8:36:38 AM UTC+3 Des Bw wrote: > Yah, that is what I am getting as well. I was able to add the missing > letter. But, the overall accuracy become lower than the default model. > > On Saturday, Octo

[tesseract-ocr] Re: Render Ground Truth from Scratch for Training

2023-10-21 Thread Des Bw
Hi Danny, Can you share your program for the community please? This is open source software; and many people are struggling to get things done. Sharing some experience and pieces of code could help a lot of people. On Saturday, October 21, 2023 at 3:30:06 AM UTC+3 Danny wrote: > The docs are

[tesseract-ocr] Re: How to create training data in teseract5.3.0 use tesstrain.sh way?

2023-10-22 Thread Des Bw
The shell script still works. But, if you are specifically looking for a python script, there are a number of python scripts posted in this forum. I personally have been using the script posted by Ali here: https://groups.google.com/g/tesseract-ocr/c/-G7TZEnVHgE On Sunday, October 22, 2023 at

[tesseract-ocr] Re: Error: traineddata file must contain at least (a unicharset fileand inttemp) OR an lstm file.

2023-10-22 Thread Des Bw
are you trying to train from scratch? On Sunday, October 22, 2023 at 8:27:26 PM UTC+3 bkpalm...@gmail.com wrote: > > I have both of these files. I don't understand. They are both prefixed > with .eng in my tessdata directory. I am so close... > -- You received this message because you

Re: [tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

2023-10-22 Thread Des Bw
k myself, but my guess is that there are things >> in the doc that could be filled without much effort by you or others. >> If this is true, once the doc is filled out, the contents of the google >> doc could be submitted as a PR to the tesstrain repo. >> Again, just a su

Re: [tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

2023-10-21 Thread Des Bw
epo. > Again, just a suggestion that I hope would be helpful to all. > > Thanks, > Keith > > On Sat, Oct 21, 2023 at 8:28 AM Des Bw wrote: > >> There is no exhaustive user manual for training tesseract. We all start >> in the darkness; and accumulate

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-10-22 Thread Des Bw
here it is: https://github.com/tesseract-ocr/tessdoc/blob/main/Data-Files-in-tessdata_best.md On Sunday, October 22, 2023 at 12:45:40 PM UTC+3 Des Bw wrote: > This is the code I used to train from a layer: > *make training MODEL_NAME=amh START_MODEL=amh APPEND_INDEX=5 > NET_SPEC

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-10-22 Thread Des Bw
0 to1. > On Saturday, 21 October, 2023 at 11:37:13 am UTC+6 desal...@gmail.com > wrote: > >> How many lines of text and iterations did you use? >> >> On Saturday, October 21, 2023 at 8:36:38 AM UTC+3 Des Bw wrote: >> >>> Yah, that is what I am getting

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-10-22 Thread Des Bw
like 10 lines >>>> of text and itaration is only 5000 to1. >>>> On Saturday, 21 October, 2023 at 11:37:13 am UTC+6 desal...@gmail.com >>>> wrote: >>>> >>>>> How many lines of text and iterations did you use? >>>>>

Re: [tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

2023-10-21 Thread Des Bw
Another useful parameter to turn ON would have been *perspective*. But, that one is not working for me. On Saturday, October 21, 2023 at 10:45:31 PM UTC+3 Des Bw wrote: > I have been experimenting with the text2image script: > Here are some of my observations

Re: [tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

2023-10-21 Thread Des Bw
. But, the final model will be less fit to recognize dense texts. On Saturday, October 21, 2023 at 8:58:01 PM UTC+3 Des Bw wrote: > That is good starter dear Keith. Very good idea. We can contribute texts > and ideas; and develop it into a booklet or "getting started guide"--maki

Re: Re: [tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

2023-10-24 Thread Des Bw
with a suggested format for a tutorial which would be very helpful to me >> and I think to others. It is editable by anyone with the link. >> >> I'm glad to put in any work myself, but my guess is that there are things >> in the doc that could be filled without much effort by

[tesseract-ocr] Re: Training from Scratch

2023-11-23 Thread Des Bw
If the original model lacks the ∠ symbol, fine tuning is not going to add it for you. We have all went through that process. To introduce a new character, removing the top layer and train from there is the most effective approach. On Thursday, November 23, 2023 at 12:15:56 PM UTC+3

[tesseract-ocr] Re: Training Metrics

2023-11-23 Thread Des Bw
I think they are abbreviations: best char error =BCER character error = CER There is no signs to tell if the model is overfit. I know no diagnostics for that. For fine-tuning, running iterations higher than 400 is always problematic because it destroys the base model. - So, the common

[tesseract-ocr] Re: Training Metrics

2023-11-23 Thread Des Bw
BCER (best character rate) is automatically picked by tesseract from all list of character rates errors (CER). On Thursday, November 23, 2023 at 12:34:40 PM UTC+3 Des Bw wrote: > I think they are abbreviations: > best char error =BCER > character error = CER > > There is n

[tesseract-ocr] Re: Training from Scratch

2023-11-23 Thread Des Bw
Download the best model and try it. If it recognizes, that is great. You an also look at the unicharset of the best model. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send

[tesseract-ocr] Re: Training from Scratch

2023-11-23 Thread Des Bw
If you are planning to train, you need to make sure that your images contain all those variations: in thickness, angle etc. I don't know if text2image can do that for you. You might need to do it manually; or use some other tool. On Thursday, November 23, 2023 at 12:39:21 PM UTC+3 Des Bw

Re: [tesseract-ocr] I am unable to train a new font to tesseract, I am getting a deserialize failed error

2023-11-22 Thread Des Bw
Probably your issue is contingent with this one: https://github.com/tesseract-ocr/tesseract/issues/792 Are you in Windows or Ubuntu? You might try by upgrading tesseract to version 5. I am not well versed into tesseract. So, my knowledge is very limited. On Thursday, November 23, 2023 at

Re: [tesseract-ocr] I am unable to train a new font to tesseract, I am getting a deserialize failed error

2023-11-22 Thread Des Bw
Make sure that the tif files are not corrupted; or the box files are not zero size. Des On 23 Nov 2023 at 9:26:39 AM, Adepu Sai Rahul wrote: > > chinnu@SaiRahul2507:~/tesseract_tutorial/tesstrain$ > TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Y145 > START_MODEL=eng

Re: [tesseract-ocr] Re: Training from Scratch

2023-11-24 Thread Des Bw
now because the default model missed one important character. On Thursday, November 23, 2023 at 8:59:01 PM UTC+3 zdenop wrote: > > št 23. 11. 2023 o 10:28 Des Bw napísal(a): > >> If the original model lacks the ∠ symbol, fine tuning is not going to >> add it

Re: [tesseract-ocr] Any success story?

2023-11-15 Thread Des Bw
racy. Any suggestions? I don't know if > tesseract would ever be able to do this alone. > > I also tried training tesseract from scratch using synthetic data but have > not yet achieved the same accuracy. I think the problem is that the > synthetic data doesn't simulate real data closely

Re: [tesseract-ocr] Any success story?

2023-11-15 Thread Des Bw
UTC+3 Merlijn Wajer wrote: > Hi, > > On 14/11/2023 06:55, Des Bw wrote: > > It looks like every one is having issues with tesseract. I am not able > > to find any one who has a great success with this software. > > It would be really encouraging to hear any success

Re: [tesseract-ocr] Any success story?

2023-11-15 Thread Des Bw
don't know if > tesseract would ever be able to do this alone. > > I also tried training tesseract from scratch using synthetic data but have > not yet achieved the same accuracy. I think the problem is that the > synthetic data doesn't simulate real data closely enough.

[tesseract-ocr] Dictionary?

2023-11-19 Thread Des Bw
Does Tesseract actually use the dictionary (wordlist) included into the model (traineddata file)? - I am not getting any difference/impact by including a dictionary (word list) into the file. Has anybody experimented with a dictionary set up? -- You received this message because you are

Re: [tesseract-ocr] Dictionary?

2023-11-19 Thread Des Bw
test for the LSTM engine. > > Zdenko > > > ne 19. 11. 2023 o 18:37 Des Bw napísal(a): > >> Does Tesseract actually use the dictionary (wordlist) included into the >> model (traineddata file)? >> >> - I am not getting any difference/impact by includi

Re: [tesseract-ocr] Extender letter recognized as underline for arabic text

2023-11-20 Thread Des Bw
On 20 Nov 2023 at 4:39:29 PM, Sifdin Nahhas wrote: > Can you try to remove it from the list of punctuations? > > To do that, you need to extract the components of the traineddata file, > edit the ara.punc file, and then recombine them. > > To extract the components: combine_tessdata -d

[tesseract-ocr] Re: How to start from scratch (new language) in Tesseract 5

2023-11-16 Thread Des Bw
Hi Jephthah, *Creating a starter traineddata: * You need: 1. *unicharset*: you can prepare it by hand. You can take the English sample and modify it. 2. *script*: if the language is written in Latin, you can download the latin script from the tesseract GitHub repo (

[tesseract-ocr] Re: How to start from scratch (new language) in Tesseract 5

2023-11-16 Thread Des Bw
which tesseract will use for the training. On Thursday, November 16, 2023 at 9:10:52 PM UTC+3 Des Bw wrote: > Hi Jephthah, > > > *Creating a starter traineddata: * > > > > You need: > > 1. *unicharset*: you can prepare it by hand. You can take the English

[tesseract-ocr] Re: Any success story?

2023-11-17 Thread Des Bw
Dear Tom, thank you for listing out all the sources . Indeed, I didn't look hard. I was mostly reading this forum; and sure, I am familiar with Shree's (Nick White?) works. >(like, a model that can detect with higher accuracy: 98% or more ?) >An accuracy figure without context is meaningless.

[tesseract-ocr] Re: How to start from scratch (new language) in Tesseract 5

2023-11-17 Thread Des Bw
ox files >> which tesseract will use for the training. >> >> On Thursday, November 16, 2023 at 9:10:52 PM UTC+3 Des Bw wrote: >> >>> Hi Jephthah, >>> >>> >>> *Creating a starter traineddata: * >>> >>> >>> >&g

  1   2   >