Re: [tesseract-ocr] Beginner question : could not initialize tesseract, missing eng.traineddata file in tessdata

2024-04-25 Thread Zdenko Podobny
If you used the tesstrain you trained the lstm engine. Why do you then ask tesseract to use a legacy engine? Do you understand what you are doing? Zdenko št 25. 4. 2024 o 11:35 Surya VaraPrasad Alla napísal(a): > eng_pcb.traineddata is a traineddata starting with eng.traineddata > > i did

Re: [tesseract-ocr] Beginner question : could not initialize tesseract, missing eng.traineddata file in tessdata

2024-04-22 Thread Zdenko Podobny
No, you are not using best float tessdata files from: https://github.com/tesseract-ocr/tessdata_best/blob/main/eng.traineddata There is nothing like eng_pcb.traineddata. (read your error message) Zdenko po 22. 4. 2024 o 17:40 Surya VaraPrasad Alla napísal(a): > Hello, > > I have the similar

Re: [tesseract-ocr] tesseract misleading in 8 and 6

2024-04-18 Thread Zdenko Podobny
Unfortunately, your post is very vague. Unless you provide a detailed description of what you are doing (step-by-step so we can replicate it), nobody can help you. Zdenko st 17. 4. 2024 o 12:14 Jayrajsinh Zala napísal(a): > I train tesseract ocr using MATLAB and use specific train data file

Re: [tesseract-ocr] Getting Error: No such file or directory: 'data/foo/all-lstmf'

2024-03-27 Thread Zdenko Podobny
You can try custom images - see the example ocrd-testset.zip And follow the example from https://github.com/tesseract-ocr/tesstrain/blob/main/README.md : unzip ocrd-testset.zip -d data/ocrd-ground-truth make training

Re: [tesseract-ocr] fine tuning on images

2024-03-27 Thread Zdenko Podobny
You can easily test your hypothesis by modifying Makefile[1] lines from tesseract "$<" $* --psm $(PSM) lstm.train to tesseract "$<" $* --psm $(PSM) -l $(START_MODEL) lstm.train [1] https://github.com/tesseract-ocr/tesstrain/blob/19f79e2d38dfeada41a96c8d87426c85a7eaa454/Makefile#L242-L255

Re: [tesseract-ocr] Lack of accuracy on reading numbers

2024-03-27 Thread Zdenko Podobny
Always test the command line if there is an issue with the wrapper. tesseract -v tesseract 5.3.4-44-g2b07 leptonica-1.84.0 (Dec 31 2023, 23:36:37) [MSC v.1929 LIB Release x64] libgif 5.1.2 : libjpeg 6b (libjpeg-turbo 2.1.90) : libpng 1.6.40 : libtiff 4.6.0 : zlib 1.2.13.zlib-ng : libwebp 1.3.2

Re: [tesseract-ocr] Reading large gray images with only numbers yields incorrect results

2024-03-26 Thread Zdenko Podobny
Yes, we have suggestions for me to improve the accuracy of the results - they are already in the documentation. Just read it. Zdenko ut 26. 3. 2024 o 13:41 inKi Wang napísal(a): > Hi everyone, I wish you all a good day. > > I'm currently encountering an issue with image_to_string producing >

Re: [tesseract-ocr] Does training new images increase the size of the traindata file?

2024-03-26 Thread Zdenko Podobny
Unless you provide information about what you do, and the possibility to replicate your process (providing input data) we do not know what is wrong with it. Did you check the example for official training[1]? In my case I see this: Output has size 7485144 (`ls -l data/ocrd.traineddata`) while

Re: [tesseract-ocr] Leptonica directory

2024-03-13 Thread Zdenko Podobny
It seems like you are not following the official documented way for compiling leptonica and tesseract. Follow it. Then we can help you. Zdenko st 13. 3. 2024 o 6:43 Ravil R napísal(a): > Windows, msvc 2022, win32, I've got some questions regarding compilation > 1) How to specify the

Re: [tesseract-ocr] user patterns with tesserocr python API

2024-03-12 Thread Zdenko Podobny
. 2024 o 17:32 Zdenko Podobny napísal(a): > Maybe I am wrong, but it looks to me like you are expecting from > user-patterns something it never promises to provide. > What we know/experienced: > >- user-patterns extends the Tesseract legacy engine dictionary. >- putt

Re: [tesseract-ocr] user patterns with tesserocr python API

2024-03-10 Thread Zdenko Podobny
no effect on the > results independently from the entries of the *.patterns file: > > api.SetVariable('user_patterns_file', > '/home/roman/Dev_d/playground/user_patterns/deu.patterns') > > Does anyone has (successfully) used user patterns with the tesserocr > Python API of tesser

Re: [tesseract-ocr] Re: Post OCR Verification and Editing

2024-03-08 Thread Zdenko Podobny
Hello, I am not sure if OCRmyPDF(https://ocrmypdf.readthedocs.io/en/latest/) allows redaction. If you would to implement text layer by yourself with custom font, have a look at PyMuPDF: - https://github.com/pymupdf/PyMuPDF/discussions/775 (Adding text layer to a scanned PDF) -

Re: [tesseract-ocr] user patterns with tesserocr python API

2024-03-02 Thread Zdenko Podobny
Can you please elaborate on: Nevertheless, user patterns is not working in the way described above. Zdenko so 2. 3. 2024 o 10:45 Roman Seidel napísal(a): > Yes, sure, the input file is a snippet with a capital letter followed by 9 > digits. The correct user pattern, corresponding to [1]

Re: [tesseract-ocr] How to correctly define CMakeLists.txt for Tesseract?

2024-02-20 Thread Zdenko Podobny
Any reason why to use an external 3rd party app that is not available on all platforms instead of cmake native function which is available everywhere cmake is? Zdenko ut 20. 2. 2024 o 18:02 Tom Morris napísal(a): > On Monday, February 19, 2024 at 4:49:07 AM UTC-5 raphael.s...@gmail.com >

Re: [tesseract-ocr] How to correctly define CMakeLists.txt for Tesseract?

2024-02-17 Thread Zdenko Podobny
First of all: you should use tools you are familiar with. Your CMake configuration (CMakeLists.txt) does not look that way (you would use CMake to check required libraries, not PkgConfig, you would not hardcode curl for linking etc...), Next. you should provide all the details to replicate the

Re: [tesseract-ocr] Re: image_to_string OSD hell

2024-02-13 Thread Zdenko Podobny
Works like a charm: just read and follow documentation carefully: >tesseract e_I_read_documetation_carefully.png - --psm 10 D >tesseract d_I_read_documetation_carefully.png - --psm 10 E >tesseract d-I_read_documetation_carefully.png - --psm 10 D- Zdenko st 14. 2. 2024 o 2:14 dev 313153

Re: [tesseract-ocr] Trouble with Apparently Simple Source Image

2024-02-12 Thread Zdenko Podobny
tesseract I_read_docs_carefully_instead_of_a_lot_of_writing.png - --psm 6 $0.081 Zdenko po 12. 2. 2024 o 18:40 Rob napísal(a): > Hello, > > I've run into some trouble using Tesseract OCR in a python program doing > some screen scraping. I can't quite wrap my head around why this one value >

Re: [tesseract-ocr] Make russian_with_accent traineddata file

2024-02-06 Thread Zdenko Podobny
You are referring old issue... You either provide steps to replicate your problem (including input image) or you have to solve it by yourself. Zdenko po 5. 2. 2024 o 9:53 Romain B. (Le Belge) napísal(a): > Hi, > > > I saw that tesseract make

Re: [tesseract-ocr] Re: I need help to develop image to text extraction

2024-02-06 Thread Zdenko Podobny
Did you read the tesseract documentation? Do you understand it? Zdenko ut 6. 2. 2024 o 12:38 Santhiya C napísal(a): > How do i fix this issue using training tesseract ocr custom data > > On Tuesday 6 February 2024 at 12:11:03 UTC+5:30 Santhiya C wrote: > >> can you please tell me model and

Re: [tesseract-ocr] OCR of free hand photo of book

2024-01-31 Thread Zdenko Podobny
Tesseract is OCR engine and the user is responsible for preprocessing - see the documentation. IMO there is already app (using tesseract) for what you try to do: Text Fairy [1] [1] https://play.google.com/store/apps/details?id=com.renard.ocr=en Zdenko st 31. 1. 2024 o 2:00 Borneq napísal(a):

Re: [tesseract-ocr] tesseract is reading passport mrz text from image incorrectly, its identifying <<<<<<<< as kkkk or cccc

2024-01-27 Thread Zdenko Podobny
Well in this case it works without image processing ;-) Anyway mrz is not "official" Tesseract training and there are people who play with it, so it will take some time to search and dig their findings/experience/expertise Zdenko so 27. 1. 2024 o 12:02 sara waheed napísal(a): > if I

Re: [tesseract-ocr] tesseract is reading passport mrz text from image incorrectly, its identifying <<<<<<<< as kkkk or cccc

2024-01-27 Thread Zdenko Podobny
What about reading docs and a little bit googling? tesseract two-page-passport-mrz-detected.jpeg - --psm 6 -l mrz IDAUT1999<6<<< 7109094F1112315AUT<<<6 MUSTERFRAU< napísal(a): > I am trying to read the passport mrz string from the image i am using > Tesseract and OpenCV

Re: [tesseract-ocr] Unable to build on macOS Mojave (10.14)

2024-01-23 Thread Zdenko Podobny
You can install tesseract without building & installing training tools. Anyway requesting tesseract as decendacy of ffmpeg makes no sense for me (and it is not listed at https://trac.ffmpeg.org/wiki/CompilationGuide/macOS). So something in homebrew should be fixed/setup correctly. Zdenko ut

Re: [tesseract-ocr] Miss lots of words in the detection

2024-01-22 Thread Zdenko Podobny
ll-lit portion. > However, when I cropped the image to retain either the upper or left > portions, Tesseract exhibited improved performance, successfully detecting > numerous words in those respective areas. > > Best, > Haitao > > On Sun, Jan 21, 2024 at 3:02 AM Zde

Re: [tesseract-ocr] Miss lots of words in the detection

2024-01-21 Thread Zdenko Podobny
Did you read the documentation or did you just set your expectations? Zdenko ne 21. 1. 2024 o 12:00 L ht napísal(a): > I am new to use tesseract. I found tesseract does not work as expected. I > attach one example. > > tesseract 5.3.2 > tesseract 272525030292764523137280353496213864766.png -

Re: [tesseract-ocr] Re: Unable to get Orientation with node-tesseract, Warning, detects only orientation with -l eng Error, OSD requires a model for the legacy engine

2024-01-13 Thread Zdenko Podobny
You do not need to rename traineddata. You can move them to tessdata subdirectory e.g. tessdata/fast, tessdata/best and then use it at "-l best/eng" or "-l fast/eng" Zdenko so 13. 1. 2024 o 3:38 Oliver Saintilien napísal(a): > Oh right, for those facing a similar issue, what I did was > 1.

Re: [tesseract-ocr] Unable to build on macOS Mojave (10.14)

2024-01-12 Thread Zdenko Podobny
ffmpeg needs tesseract training tools as dependency? I guess something is misconfigured. On most unix-like system training tools are separated and not installed by default Can you avoid the 'make training' step? I also wonder why the tesseract build process did not stop during configuration

Re: [tesseract-ocr] Re: Unable to get Orientation with node-tesseract, Warning, detects only orientation with -l eng Error, OSD requires a model for the legacy engine

2024-01-12 Thread Zdenko Podobny
*tesseract executable problem:* for TESSDATA_PREFIX you use a path with space and you did not not escape it properly. That is why you get an error about an existing file ("C:\Program/eng.traineddata"). Solutions: a) use path without speciation characters like space b) learn how to properly

Re: [tesseract-ocr] Re: Unable to get Orientation with node-tesseract, Warning, detects only orientation with -l eng Error, OSD requires a model for the legacy engine

2024-01-11 Thread Zdenko Podobny
Unfortunately you don't. Instead of showing irrelevant information, make sure tesseract (outside of wrapper) is providing expected results. You are claiming "I keep getting an error that I have to set the TESSDATA_PREFIX" but your only relevant screenshot (you made it hardly readable) shows

Re: [tesseract-ocr] Unable to build on macOS Mojave (10.14)

2024-01-11 Thread Zdenko Podobny
std::filesystem::path[1] is part of the C++17 standard and Tesseract requires this standard for a long time (4-5 years)[2]. So you suggest reverting this decision? [1] https://en.cppreference.com/w/cpp/filesystem/path [2]

Re: [tesseract-ocr] Using tesseract in node

2024-01-10 Thread Zdenko Podobny
Tesseract is not trained for handwritten text. Zdenko st 10. 1. 2024 o 7:02 Sandeep Shakya napísal(a): > import tesseract from "node-tesseract-ocr"; > import fs from "fs"; > > const img = fs.readFileSync("./src/extract_user_input/2.jpg"); > > const config = { > lang: "eng", > // oem: 1, >

Re: [tesseract-ocr] Unable to build on macOS Mojave (10.14)

2024-01-10 Thread Zdenko Podobny
... I am trying to upgrade tesseract from 5.2.0 to 5.3.3 on macOS 10.14.6 ... is unavailable: introduced in macOS 10.15 Upgrade? Zdenko st 10. 1. 2024 o 17:29 Benoît Mars napísal(a): > I am trying to upgrade tesseract from 5.2.0 to 5.3.3 on macOS 10.14.6 via > Homebrew (v 4.2.3). The build

Re: [tesseract-ocr] Errors With Downloading Tesseract v4.1.1

2024-01-08 Thread Zdenko Podobny
Please provide full log of whole process (starting from autogen.sh) Zdenko ut 9. 1. 2024 o 6:50 Evaan Ahmed napísal(a): > Hey y'all! > > On my local machine (a Mac), I'm trying to download the version of > Tesseract that is available on Google Colab. This is version 4.1.1. I > downloaded the

Re: [tesseract-ocr] Phantom characters

2024-01-01 Thread Zdenko Podobny
post: 1. Original image (without preprocessing) 2. + image used for OCR (preprocessed) 3. + output from tesseract executable (not tesseract wrappers) and used parameters/option Otherwise, nobody can reproduce the problem and therefore suggest a solution. Zdenko ne 31. 12. 2023 o

Re: [tesseract-ocr] Failed to load list of training filenames from data/foo/list.train

2024-01-01 Thread Zdenko Podobny
Follow https://github.com/tesseract-ocr/tesstrain/blob/main/README.md Tesseract OCR 3.05.02 was released 6 years ago... Zdenko so 30. 12. 2023 o 18:24 Omar Samir napísal(a): > I was trying to train Tesseract-OCR on the ocrd-testset.zip in the README, > and I get this error above in the

Re: [tesseract-ocr] tessract usage

2024-01-01 Thread Zdenko Podobny
Did you check license? https://github.com/tesseract-ocr/tesseract/blob/main/LICENSE Zdenko st 27. 12. 2023 o 17:56 Ajay Bhosle napísal(a): > Can i use tesseract to extract text from pdf for commercial use? > > -- > You received this message because you are subscribed to the Google Groups >

Re: [tesseract-ocr] inaccuracy in plane text

2023-12-25 Thread Zdenko Podobny
I put it to documentation because I had the same problem as you (to find it) :-) Zdenko po 25. 12. 2023 o 4:40 Ger Hobbelt napísal(a): > > > On Sat, 23 Dec 2023, 19:16 Zdenko Podobny, wrote: > >> tesseract expects black text (lettering) on a white background: that's &

Re: [tesseract-ocr] inaccuracy in plane text

2023-12-23 Thread Zdenko Podobny
tesseract expects black text (lettering) on a white background: that's what is has been trained on and that's what will work best. Hence: try to convert anything to look like that before feeding it to Tesseract. This is not needed (in all cases ;-) ): tesseract inverts a image by itself

Re: [tesseract-ocr] Font Not Found Error

2023-12-20 Thread Zdenko Podobny
1. tesseract 4 is outdated. 2. tesstrain.sh is depreciated Zdenko st 20. 12. 2023 o 11:18 Uvindu Bimsara napísal(a): > When i started training tesseract 4.0 using tesstrain.sh for sinhala > unicode font got this error. > === Starting training for language 'sin' [Wed Dec 20 09:44:58 AM UTC

Re: [tesseract-ocr] Numbers detection

2023-12-19 Thread Zdenko Podobny
Hello, For Tesseract you need to remove all non-text parts (graphics element). IMO also the outline number would be problematic. It would be better to post the original image so people can play with preprocessing... See e.g. this discussion

Re: [tesseract-ocr] getComponnentImages falling short of a few words/ characters

2023-12-17 Thread Zdenko Podobny
First of all, provide the original input image. Next, it would be nice to see code to replicate the problem. Zdenko ne 17. 12. 2023 o 8:04 'Muhammad Ali' via tesseract-ocr < tesseract-ocr@googlegroups.com> napísal(a): > Hi team, > > I had a few recurring issues regarding inaccuracy of

Re: [tesseract-ocr] Fasten Tesseract OCR

2023-12-14 Thread Zdenko Podobny
A more effective approach to addressing the issue is to create a test/example case. Advanced users can then evaluate and potentially offer solutions It would be helpful if you could provide details on how you obtain and process the input images, as well as the OCR execution method (API, wrapper,

Re: [tesseract-ocr] Fasten Tesseract OCR

2023-11-29 Thread Zdenko Podobny
Your request is too general e.g. reply could be "upgrade your hardware"... ;-) Unless you provide details about your testing environment + process of measuring speed and testing images, there is just one general advice: read the docs and issue tracker (including closed issues), there are several

Re: [tesseract-ocr] Tesseract on single digit detection

2023-11-27 Thread Zdenko Podobny
Crop images properly (without borders) and follow suggestions in docs: >tesseract pic2_cropped_postprocessed.png - --psm 10 5 >tesseract pic4_cropped_postprocessed.png - --psm 10 7 Zdenko po 27. 11. 2023 o 9:42 Fernando Benayas de los Santos < ferbenaya...@gmail.com> napísal(a): > Hi

Re: [tesseract-ocr] Failed loading language 'eng'

2023-11-25 Thread Zdenko Podobny
tesseract 3.x is unsupported. I am not Java developer, but according https://github.com/nguyenq/tess4j/releases tess4j-5.8.0 should support Tesseract 5.3.2, so I would start from that. If there is still a problem have a look at their wiki ( https://github.com/nguyenq/tess4j/wiki) and issue

Re: [tesseract-ocr] Failed loading language 'eng'

2023-11-25 Thread Zdenko Podobny
you used an old unsupported version of your tools (not sure if the problem is in the used/installed wrapper or Tesseract library...) - the cube engine was removed from Tesseract several years ago... Zdenko so 25. 11. 2023 o 15:31 'sanogo sy' via tesseract-ocr < tesseract-ocr@googlegroups.com>

Re: [tesseract-ocr] Failed loading language 'eng'

2023-11-25 Thread Zdenko Podobny
And the result is? Zdenko so 25. 11. 2023 o 13:07 'sanogo sy' via tesseract-ocr < tesseract-ocr@googlegroups.com> napísal(a): > I forgot to mentione that I use Centos 7. > I tried that command : tesseract img.jpg out > > As result I got a message like: > > Estimating resolution as 181 > Error

Re: [tesseract-ocr] Failed loading language 'eng'

2023-11-25 Thread Zdenko Podobny
Does tesseract (executable) has the same problem? If yes, that check the content of /usr/share/tesseract-ocr/4/tessdata/ If not follow code of tesseract executable. Zdenko so 25. 11. 2023 o 11:07 'sanogo sy' via tesseract-ocr < tesseract-ocr@googlegroups.com> napísal(a): > Hi every one. I got

Re: [tesseract-ocr] I am unable to train a new font to tesseract, I am getting a deserialize failed error

2023-11-23 Thread Zdenko Podobny
Please provide files for replicating the problem, otherwise Zdenko št 23. 11. 2023 o 8:29 Adepu Sai Rahul napísal(a): > the tif files are not corrupted and box files are not of size zero > > > On Thursday, November 23, 2023 at 12:51:49 PM UTC+5:30 desal...@gmail.com > wrote: > >> Make

Re: [tesseract-ocr] Re: Training from Scratch

2023-11-23 Thread Zdenko Podobny
št 23. 11. 2023 o 10:28 Des Bw napísal(a): > If the original model lacks the ∠ symbol, fine tuning is not going to add > it for you. Really??? Tesseract documentation

Re: [tesseract-ocr] Troubling with reading text from image

2023-11-19 Thread Zdenko Podobny
Captcha was created to fool OCR, so Tesseract output is as expected ;-) Zdenko ne 19. 11. 2023 o 19:15 Исмаилов Ориф napísal(a): > Hi, i have images where i should read text and numbers, but i am having > trouble with this > [image: Снимок экрана 2023-11-19 230906.png] > here is what

Re: [tesseract-ocr] Dictionary?

2023-11-19 Thread Zdenko Podobny
AFAIR there were tests with the legacy engine where the effect of improving results quality by dictionaries where measured as 10-15% for common text. However: adding a word to a dictionary has never ensured Tesseract's accurate recognition of that word. For non-word inputs (e.g. serial numbers

Re: [tesseract-ocr] DLL runtime issues with API on Windows

2023-11-11 Thread Zdenko Podobny
Please provide full information to replicate the problem (exact code, how did you completed it...) Zdenko so 11. 11. 2023 o 15:20 Anthony Vallone napísal(a): > Hello, > > I am using MSYS2 to install tesseract on Windows, following the installation > instructions

Re: [tesseract-ocr] LSTM-based training produces .box files with the same coordinates

2023-11-01 Thread Zdenko Podobny
Are you following official tutorials? Did you read the documentation? Have you tried to check the official training repository and provided examples? Zdenko st 1. 11. 2023 o 10:15 TRAN TRONG KHANH[학생](대학원 컴퓨터공학과) ‍ < khanhtran...@khu.ac.kr> napísal(a): > Hi all, > > I tried to run an example

Re: [tesseract-ocr] Getting Error: No such file or directory: 'data/foo/all-lstmf'

2023-10-28 Thread Zdenko Podobny
It does not work on windows (directly) but it works on linux => use WSL if you really need training. Or wait until somebody find a fix for windows (or send the fix - this is an open source project so everybody should contribute ;-) ) Zdenko pi 27. 10. 2023 o 17:32 Dev Solution napísal(a): > >

Re: [tesseract-ocr] OCR Output contains "xlz"

2023-10-15 Thread Zdenko Podobny
Seam like you should put this question to the author of language data "ARYuanB5-MD"... Zdenko ne 15. 10. 2023 o 15:44 'Danny Wilson' via tesseract-ocr < tesseract-ocr@googlegroups.com> napísal(a): > Running tesseract on a single Chinese character "對" outputs the character, > but also the text

Re: [tesseract-ocr] "Leptonica was build without TIFF support! Disabling TIFF support..."

2023-10-15 Thread Zdenko Podobny
Honestly, this is a very messy configuration for me. Why? Tesseract (and other projects) use CMake to avoid such manual settings. Just follow the example in our GitHub action for cmake[1] - it is simply stupid and it works. Cmake takes care of correct linking (debug/release), and build (no need

Re: [tesseract-ocr] Deserialize Header Failed

2023-10-14 Thread Zdenko Podobny
Hello, tesseract works out of the box. What does not work are you users, downloading Tesseract at night and jumping to Tesseract training. Training requires knowledge and experience that you will not get by following some random internet tutorials (most of them are outdated, pretending to be

Re: [tesseract-ocr] "Leptonica was build without TIFF support! Disabling TIFF support..."

2023-10-09 Thread Zdenko Podobny
Please provide full logs including installation, configure parameters etc. - not screenshots. Make should you have only one installation of leptonica library May your own test if leptonica is built with tiff. Use release target and not debug. Zdenko ne 8. 10. 2023 o 21:56 DJuego Director De

Re: [tesseract-ocr] Multiple colours text in an image

2023-10-07 Thread Zdenko Podobny
Hello, this is about image preprocessing/thresholding rather than tesseract... Please post an example image so tesseract users can test it and suggest a possible solution. Zdenko št 21. 9. 2023 o 13:04 Iago Giné napísal(a): > Hi all, > > Is there some option to tell tesseract-ocr that there

Re: [tesseract-ocr] quality of recognition of customer invoices

2023-09-22 Thread Zdenko Podobny
I know there are (were) people at the forum that implemented Tesseract as part of invoice processing - but as a commercial solution. It is not as easy as it looks: there is a need for a custom solution for text detection (e.g. skipping logos and other graphics, possible handwriting). As far as I

Re: [tesseract-ocr] how to manual install tesseract-ocr all code include third library code build without cmake in windows

2023-09-21 Thread Zdenko Podobny
Why do you what to compile tesseract? Zdenko št 21. 9. 2023 o 15:26 Phoenix Tree napísal(a): > i am noob. > > some limit in my windows machine , > I can't have network, I must manual download tesseract-ocr all code > include third library code > can't use cmake > but can write python script >

Re: [tesseract-ocr] Tesseract Custom Model Not Recognized after Training

2023-09-18 Thread Zdenko Podobny
Unfortunately you hid all important information (e.g. how did you run training? how did you run tesseract (including tesseract options, exact command or code,...)? , so just some hints: > Error: LSTM requested, but not present!! This implies that the requested traineddata file does not contain

Re: [tesseract-ocr] Strange behaviour of Tesseract

2023-09-14 Thread Zdenko Podobny
> > Is it still broken in version 5? The thread you posted is from 2017! [image: image.png] Zdenko št 14. 9. 2023 o 17:10 Gilad Pellaeon napísal(a): > Is it still broken in version 5? The thread you posted is from 2017! > > One thing I noticed in the meantime: I stored my PNGs with

Re: [tesseract-ocr] Strange behaviour of Tesseract

2023-09-14 Thread Zdenko Podobny
https://github.com/tesseract-ocr/tesseract/issues/845 Zdenko št 14. 9. 2023 o 16:49 Gilad Pellaeon napísal(a): > Hi, > > I am new to Tesseract. I searched for an OCR library, found Tesseract and > now I want to use it for a specific measure protocol. > > I built Tesseract 5.3.2 from source

Re: [tesseract-ocr] Normalization failed for string

2023-09-14 Thread Zdenko Podobny
unicharset is created automatically (by official training procedure https://github.com/tesseract-ocr/tesstrain) Zdenko št 14. 9. 2023 o 13:56 Ali hussain napísal(a): > I have faced in my own trianed_text this normalization error. I think the > main problem is * ্য*in these words. and i

Re: [tesseract-ocr] Preprocess screenshot image before tesseract.

2023-08-29 Thread Zdenko Podobny
Please do not send it to the mailing list compressed images (rar, zip). Post them somewhere or use appropriate image format to decrease their size (renaming bmp file to png does not work) Zdenko ut 29. 8. 2023 o 9:01 Ajay Pandya napísal(a): > Hello Everyone, > > Can anyone help me with the

Re: [tesseract-ocr] Whitelist is not accepting special characters

2023-08-27 Thread Zdenko Podobny
IMO there is not need to use psm and whitelist: tesseract text.png - -l fast/script/Latin Estimating resolution as 274 Ñato ñelo ñaña álca moño Ñoko niño niña chillňa élif For Windows I guess there could be a problem with UTF-8 in the terminal... Zdenko ne 27. 8. 2023 o 6:25 Shadya S.

Re: [tesseract-ocr] Suggestions for Windows 10 x64 build issue

2023-08-20 Thread Zdenko Podobny
Maybe you should provide a simple test case for replicating the problem including information on how did you build tesseract). E.g. for SetRectangle_test.cpp (from https://groups.google.com/g/tesseract-ocr/c/PMHq6YSpRRE/m/Z2DCrgQlAAAJ) links without problem for me: cl /EHsc SetRectangle_test.cpp

Re: [tesseract-ocr] Question reg. Telugu ; char missing in ocr ; how to fix ?

2023-08-17 Thread Zdenko Podobny
Please provide details of what are you doing including details of Tesseract version, OS, and which tessdata you used...) Make sure you read tesseract documentation and please provide also details on which suggested solution you used and which char is missing (as not everybody is familiar with

Re: [tesseract-ocr] only english language is recoganising

2023-08-17 Thread Zdenko Podobny
We are sorry, but we have no clue what are you doing. Please provide the details for replicating your problem. Zdenko so 12. 8. 2023 o 20:25 V S KARTHIK napísal(a): > Hi, > malaylam or any other language is not extracting from image why?anybody > knows? > > -- > You received this message

Re: [tesseract-ocr] SetRectangle change?

2023-08-01 Thread Zdenko Podobny
Yes, there is a problem with SetRectangle or there is a mismatch between other API functions (e.g. GetThresholdedImage). It could be demonstrated with the attached simple code. According to API [1] SetRectangle(left, *top*, width, height) e.g. SetRectangle(left, top, width, height *.3) should

Re: [tesseract-ocr] Tesseract-ocr in quiet mode

2023-07-23 Thread Zdenko Podobny
It is not a tesseract problem but the VB. Prove for this you can find in pytesseract that call tesseract executable without console windows. Zdenko ne 23. 7. 2023 o 15:55 nor s napísal(a): > Is there a way to have Tesseract run without producing a Dos window? I'm > incorporating a call to

Re: [tesseract-ocr] missing tesseract_opencl_profile_devices.dat (or how to disable OpenCL)

2023-07-16 Thread Zdenko Podobny
It is incompetent and irresponsible to use an experimental code in production/distribution. Zdenko ne 16. 7. 2023 o 21:13 Markus Leuthold napísal(a): > It looks like OpenSuse TW builds the package with "--enable-opencl" > >

Re: [tesseract-ocr] missing tesseract_opencl_profile_devices.dat (or how to disable OpenCL)

2023-07-16 Thread Zdenko Podobny
There is no possibility to disable OpenCL at run time. OpenCL is disabled by default and marked as experimental, not suggested by the forum/issue tracker, etc. It is there (as compile option) only as a startup point for possible developers. Zdenko ne 16. 7. 2023 o 17:21 Markus Leuthold

Re: [tesseract-ocr] tesseract runs but gives no output

2023-07-15 Thread Zdenko Podobny
tesseract d:\temp\temp\Screenshot_20230601_102638.jpg -l eng+hin 1>>c:\temp\temp2.txt is not the correct command. Did you mean: tesseract d:\temp\temp\Screenshot_20230601_102638.jpg output -l eng+hin 1>>c:\temp\temp2.txt please consult tesseract --help Zdenko pi 14. 7. 2023 o 14:19 Ales

Re: [tesseract-ocr] OCR inconistencies

2023-07-13 Thread Zdenko Podobny
Hello, I am not sure what you do you meant with "Redact 5.3.1", but please provide test case to reproduce problem. For me tesseract works: tesseract incon.png - --- Hidden text -- Zdenko st 12. 7. 2023 o 16:37 Jamiel Impoy napísal(a): > Hello, > > For Redact 5.3.1, there is a strange edge

Re: [tesseract-ocr] Any ways to further improve OCR results

2023-07-08 Thread Zdenko Podobny
I am not sure what you mean by "I have tried setting the Region of Interest (ROI) ", but when I cut region and pre-processed it as described in the documentation I got the correct results: tesseract frame_1-ROI1_preprocessed.png - --psm 7 GOH SCE YUAN tesseract frame_1-ROI2_preprocessed.png -

Re: [tesseract-ocr] Help Using tesstrain for machine generated display

2023-07-08 Thread Zdenko Podobny
Have at https://github.com/tesseract-ocr/tesseract/issues/2342 and search for "tesseract OCR dot matrix", there are several suggestions on how to improve OCR results e.g. https://jeffreymorgan.io/articles/improve-dot-matrix-ocr-performance-tutorial/ PS: it does not make sense to post custom

Re: [tesseract-ocr] libtesseract skip OCR, just create invisible text layer

2023-07-08 Thread Zdenko Podobny
No, it is not possible (tesseract uses an image used for OCR for pdf creation, OCR output for the position of text...) Zdenko st 5. 7. 2023 o 7:12 lbr napísal(a): > I'm trying to create a searchable pdf out of a scanned one. I want to use > Textract as an OCR engine instead of Tesseract. Is

Re: [tesseract-ocr] START_MODEL gives Segmentation Failure Error

2023-07-08 Thread Zdenko Podobny
Let's start with the basics: The current leptonica version is 1.83.1 https://github.com/DanBloomberg/leptonica/releases The current tesseract version is 5.3.1 https://github.com/tesseract-ocr/tesseract/releases Use the latest version if there is a problem. Nobody wants to waste time with

Re: [tesseract-ocr] Unable to get conversion from colorful odia pdf

2023-07-08 Thread Zdenko Podobny
If you are interested in helping, please provide a description/images of what are you doing/using. Zdenko st 5. 7. 2023 o 7:17 Sailesh Agrawal napísal(a): > Hi, this is Sailesh > I have been trying to use tesseract for extracting oriya test from pdf, it > is working fine will black and white

Re: [tesseract-ocr] 3 signs read -> result is an extra symbol?

2023-07-08 Thread Zdenko Podobny
It is not a bug. Use a better text editor (that supports utf-8). Zdenko pi 7. 7. 2023 o 8:15 z20leh napísal(a): > Hallo, > i use ubuntu-22.04.2-desktop-amd64 > tesseract 4.1.1 leptonica-1.82.0 > > i have small picutres with only 3 numbers. > For example: > 1,02 > in the outputfile of

Re: [tesseract-ocr] OCRmyPDF and Tesseract not making PDFs searchable

2023-07-03 Thread Zdenko Podobny
1. Provide also example files (input, output) 2. Tesseract does not accept pdf (it needs an image as input), so at least 3. seems to be a problem of OCRmyPDF. Provide also the output of "tesseract --version" command Zdenko po 3. 7. 2023 o 21:24 Filippos Koliopanos napísal(a): > > Hello, > >

Re: [tesseract-ocr] Any ways to further improve OCR results

2023-06-27 Thread Zdenko Podobny
without an example image nobody can help you. Zdenko ut 27. 6. 2023 o 12:01 Lee Kar Yee napísal(a): > Hi all, > > I am new to Tesseract OCR. I am trying to achieve extracting alphabets and > numbers from images. > These images are being converted from a mp4 video into frames as JPG. > > While

Re: [tesseract-ocr] Failed to load list of training filenames from data/Chin/list.train

2023-06-24 Thread Zdenko Podobny
Please provide full log of training, including how did you installed tessereact, training tool etc. Zdenko št 22. 6. 2023 o 11:15 abhilash rao napísal(a): > Hi guys so i am trying to train tesseract using wsl and when i execute the > following training command TESSDATA_PREFIX=.../tessdata

Re: [tesseract-ocr] START_MODEL gives Segmentation Failure Error

2023-06-24 Thread Zdenko Podobny
Hello, If you are really looking for help, you need to provide full details (e.g. whole log of training, how did you installed tesseract, which version of tesseract, how did you install model (specieally hin model) example of training data that help to replicate "Segmentation Failure" etc.

Re: [tesseract-ocr] Regarding facing issue in tesseract download

2023-06-22 Thread Zdenko Podobny
Hello, If you are serious about getting help, provide details (we have no clue what your system is, we do not know how you try to extract an image from pdf etc...). Make sure you read tesseract documentation first. Zdenko št 22. 6. 2023 o 14:48 Aniket Kumar napísal(a): > In my system When I

Re: [tesseract-ocr] Original training data for eng.traineddata

2023-06-20 Thread Zdenko Podobny
With opensourced data you will not be able to create (from scratch) the same quality traineddata as Google provided. However there are some projects that fine tuned Google model successfully e.g. (UB-Mannheim/: https://madoc.bib.uni-mannheim.de/53748/ ) Zdenko st 21. 6. 2023 o 4:38 Duy Khanh

Re: [tesseract-ocr] Runic OCR with tesseract

2023-06-20 Thread Zdenko Podobny
https://github.com/tesseract-ocr/langdata and https://github.com/tesseract-ocr/langdata_lstm provide input data that could be useful for tesseract training. I am not aware of Runic traineddata released by Google or contributors => you will need to create it by yourself. Zdenko ut 20. 6. 2023 o

Re: [tesseract-ocr] Unable to generate Hindi line images using text2image

2023-06-20 Thread Zdenko Podobny
Please follow the official training procedure [1], read the official docs[2], or complain to the author of the tutorial you decide to follow. [1] https://github.com/tesseract-ocr/tesstrain [2] https://tesseract-ocr.github.io/tessdoc/ Zdenko ut 20. 6. 2023 o 10:39 abhilash rao napísal(a): >

Re: [tesseract-ocr] Building for iOS arm-64 produces x86_64 library

2023-06-20 Thread Zdenko Podobny
Please do not post only the last error - usually, there is a problem before and e,g, configure output could indicate a lot of... Make sure you check the issue tracker where are already some hints on what to check e.g. https://github.com/tesseract-ocr/tesseract/issues/3980

Re: [tesseract-ocr] Building for iOS arm-64 produces x86_64 library

2023-06-18 Thread Zdenko Podobny
Hello, I am not Mac user, but the following output indicates that autotools are not able to use g++ for arm-64: checking for arm-apple-darwin64-g++... no checking for arm-apple-darwin64-clang++... no Also, you try to force linking LIBS="-lz -lpng -ljpeg -ltiff", but configure claims tiffio.h

Re: [tesseract-ocr] What is the lstm.train file used for?

2023-06-16 Thread Zdenko Podobny
it is used for training. Zdenko st 14. 6. 2023 o 11:31 Duy Khanh napísal(a): > In the "Makefile" file of tesstrain, there are parts where the following > command is executed: > ``` > tesseract --psm 13 lstm.train > ``` > > Why does it run tesseract with the lstm.train file? If I am running

Re: [tesseract-ocr] Segmentation fault with `tesseract -v`

2023-06-16 Thread Zdenko Podobny
How did you build tesseract? What platform did you use? What compiler? etc. please communicate details otherwise you are alone with your problems... Does it crash when you run it from the command line? Zdenko pi 16. 6. 2023 o 6:20 Abhishek Chaudhary napísal(a): > Hi, I'm building tesseract

Re: [tesseract-ocr] unicharset is not returning anything

2023-06-11 Thread Zdenko Podobny
Hello, 1. Version 4.x is old, outdated, and unsupported. Use the current tesseract version (5.3.1) 2. Which official training procedure do you follow? 3. Do you intentionally try to train the legacy engine (I assume based on your box file)? BTW: Legacy training was broken and it is

Re: [tesseract-ocr] Getting Error: No such file or directory: 'data/foo/all-lstmf'

2023-06-06 Thread Zdenko Podobny
Do not create files manually. If "make training" does not work it means: 1. you miss some dependency or input data are wrong 2. also you miss error message for 1. I strongly suggest you to start training from the beginning (including cloning tesstraing) and pay attention to all messages:

Re: [tesseract-ocr] Need help, how to recognize numbers from this image?

2023-06-05 Thread Zdenko Podobny
follow suggestion https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md Zdenko po 5. 6. 2023 o 18:42 ChrisL napísal(a): > I have tried with psm 1 to psm 13, does not work well. > > Thanks > > -- > You received this message because you are subscribed to the Google Groups >

Re: [tesseract-ocr] Where to find documentation on config files and parameters?

2023-06-05 Thread Zdenko Podobny
Funny, but when I open your link ( github ) I see there: , STRING_MEMBER(tessedit_char_blacklist, "", "Blacklist of chars not to recognize", this->params()) BTW: did you try to run

Re: [tesseract-ocr] Help in Training Tesseract5 using purely windows OS and python

2023-06-05 Thread Zdenko Podobny
Hello, 1. If you are a newbie to tesseract 5, one of the worst things is to start training tesseract. 2. I am not sure what do you mean with "I have read the documentation available on github(readme.md) " - tesseract documentation is here: https://tesseract-ocr.github.io/tessdoc/

  1   2   3   4   5   6   7   8   9   10   >