Re: [tesseract-ocr] URL support in 4.1.1

2021-07-02 Thread Zdenko Podobny
It is packager/buillder decision which features will be allowed. If you do not like it, compile it by yourself. It is free and opensource code. On Thu, 1 Jul 2021, 17:54 Ola Nowak, wrote: > Thanks for the answer but I don't really understand how is it possible. > Why there are incomplete

Re: [tesseract-ocr] URL support in 4.1.1

2021-06-30 Thread Zdenko Podobny
Your tesseract build has no support for curl library => no support for url... https://github.com/tesseract-ocr/tesseract/blob/75e6c3ea4c8eae740fb65a84e77dbf0c8d092240/src/api/baseapi.cpp#L1148-L1182 "Correct" output would be like this: tesseract 5.0.0-alpha-20201231-536-gd5fb7 leptonica-1.81.0

Re: [tesseract-ocr] What is the proper format of the word list file for training tesseract?

2021-06-20 Thread Zdenko Podobny
see https://github.com/tesseract-ocr/langdata/tree/master/eng Zdenko ne 20. 6. 2021 o 7:33 Sim Tov napísal(a): > > Hello, > > it is written in the documentation/Creating Starter Traineddata: > > >

Re: [tesseract-ocr] Tesseract does not recognise these numbers

2021-06-18 Thread Zdenko Podobny
With tessdata from [1] and oem 0 you can get: tesseract unnamed.png - --psm 7 --oem 0 09:41 Dm Otherwise: tesseract unnamed.png - --psm 7 0%:41 pm With small preprocessing (blur and resize, so letter have high around 30 points) you can get : tesseract time.png - --psm 7 09:41 pm [1]

Re: [tesseract-ocr] Re: Tesseract gets space wrong

2021-06-04 Thread Zdenko Podobny
search issue tracker and forum for "table" Zdenko pi 4. 6. 2021 o 17:13 Jeremy Young napísal(a): > It looks like there's a bug of some sort here. Attached is another image. > When I COR it with > > "tesseract test.png test -c tessedit_create_hocr=1 -c hocr_char_boxes=1" > > the hocr for

Re: [tesseract-ocr] Stop OCRing

2021-06-02 Thread Zdenko Podobny
Have a look at the ETEXT_DESC *monitor (e,.g. [1]). It is usually used for getting progress monitor (e.g. [2] ), but should be used for cancelling progress too according header file. [1]

Re: [tesseract-ocr] Help processing this tiny image with a date

2021-05-15 Thread Zdenko Podobny
> tesseract -v tesseract 5.0.0-alpha-20210401-66-g91b2b4 leptonica-1.81.0 (Apr 16 2021, 16:18:45) [MSC v.1928 LIB Release x64] libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.0.91) : libpng 1.6.37 : libtiff 4.2.0 : zlib 1.2.11 : libwebp 1.2.0 : libopenjp2 2.4.0 Found AVX2 Found AVX Found FMA

Re: [tesseract-ocr] Can we use LSTM traineddata on legacy engine?

2021-05-03 Thread Zdenko Podobny
no. Zdenko po 3. 5. 2021 o 16:36 bulamadim bisey napísal(a): > Hello i'm new and i'm wondering if we did train language in tesseract 4.x > can we use on legacy engine like some applications doesnt accept my > traineddata > > -- > You received this message because you are subscribed to the

Re: [tesseract-ocr] Why output pdf of textonly_pdf=1 does not contain any data?

2021-04-30 Thread Zdenko Podobny
I am not sure what is your problem: file is not empty and tesseract gave you output exactly what you asked for[1], [2]. It is not a tesseract issue or bug that you are not familiar with the command you used. [1]

Re: [tesseract-ocr] Tesseract ocr

2021-04-24 Thread Zdenko Podobny
ct feature to > extract the name,email address,amounts type of fields from documents. > > On Sat, Apr 24, 2021 at 2:50 PM Zdenko Podobny wrote: > >> Please be more specific: provide an example of what your input is and >> what you want to achieve. >> >> Zdenko &

Re: [tesseract-ocr] Tesseract ocr

2021-04-24 Thread Zdenko Podobny
Please be more specific: provide an example of what your input is and what you want to achieve. Zdenko so 24. 4. 2021 o 7:58 Mohammad Waqas Shoukat Ali napísal(a): > hi team, > > i want to understand how i can teach my tesseract model for different > files format. > > -- > You received this

Re: [tesseract-ocr] Where to find the Tesseract.dll for Tesseract OCR version v5.0.0.

2021-04-23 Thread Zdenko Podobny
We are focused on providing source code and not binary packages, so most probably we will not provide it. IMO you do not need tesseract.dll, as itself it would not help you. Maybe it will be better if you describe what exactly you try to do/achieve. Building tesseract is no big deal (just few

Re: [tesseract-ocr] Where to find the Tesseract.dll for Tesseract OCR version v5.0.0.

2021-04-22 Thread Zdenko Podobny
1. version 5 is not released yet 2. did you try to read documentation? Zdenko št 22. 4. 2021 o 3:03 Sharp Subbu napísal(a): > Dear Friends, > > We have tried to find the Tesseract.dll for Tesseract OCR version v5.0.0. > in the Tesseract git hub url

Re: [tesseract-ocr] Re: tessedit_create_boxfile condensed like boxaGetBox

2021-04-21 Thread Zdenko Podobny
Use tsv output but you will still need to parse it to get line information. Zdenko st 21. 4. 2021 o 16:38 Baris Unsal napísal(a): > I want the opposite way. Getting ril_textline like output from passing > argument to tesseract. > > On Wednesday, 21 April 2021 at 17:36:35 UTC+3 Quan Nguyen

Re: [tesseract-ocr] detect decimal point in amount with psm 11

2021-04-21 Thread Zdenko Podobny
1. You got the result for the image you provided. 2. I suggest you to use other oem 3. I know that invoice digitalizator use different parameters for parsing numbers. Zdenko st 21. 4. 2021 o 17:45 Kumar Rajwani napísal(a): > Hi Zdenop, As i said i know psm 6 working better in

Re: [tesseract-ocr] detect decimal point in amount with psm 11

2021-04-21 Thread Zdenko Podobny
Try to use better config parameters. e.g: $ tesseract download.png - --psm 6 --oem 0 will produce: $ 250,941.00 $ -75,282.00 $ 175,659.00 $ -15,072 00 $ 2,860.00 $ 0.00 $ 163,447.00 legacy engine could be better for numbers Zdenko st 21. 4. 2021 o 14:10 Kumar Rajwani napísal(a): > Hey, > I

Re: [tesseract-ocr] tessedit_create_boxfile condensed like boxaGetBox

2021-04-21 Thread Zdenko Podobny
Hello, it is unclear for what do you do/want to do: - you wrote want individual chars, but request from API line (RIL_TEXTLINE) - then you wrote " Is there any way to combine individual boxes to print like API" so what do you want to combine? Maybe it would be better if you provide

Re: [tesseract-ocr] My data looks clean, why is it not recognised properly

2021-04-20 Thread Zdenko Podobny
Tesseract is an OCR engine, so try to eliminate graphics elements by yourself/send only text areas to OCR. Zdenko ut 20. 4. 2021 o 10:40 Soul Green napísal(a): > Omg thanks. > I hadn't thought about checking *that *documentation. I've been using > tesseract.js with node so I completely forgot

Re: [tesseract-ocr] tesseract v5.0 is not getting J, sometimes K

2021-04-20 Thread Zdenko Podobny
Why do you think that training will help? What does it mean "TESSERACT 5.0 for WINDOWS"? Which version of language data you used? Who did you preprocess images? Zdenko po 19. 4. 2021 o 20:21 Filipe Benetti napísal(a): > Hello guys, > > Im trying to get these values from a .PNG and it's not

Re: [tesseract-ocr] How to reduce the size of a OCRed pdf file using Tesseract OCR APIs.

2021-04-14 Thread Zdenko Podobny
Tesseract is an OCR engine and it does not change input image. For recompressing pdf you need other tools e.g. jbig2enc [1] , mupdf [2]... [1] https://github.com/agl/jbig2enc [2] https://mupdf.com/docs/manual-mutool-convert.html Zdenko st 14. 4. 2021 o 15:26 Sharp Subbu napísal(a): > Dear

Re: [tesseract-ocr] OCR is not working for images with dark background and light letters

2021-04-02 Thread Zdenko Podobny
if you have problem with textfairy, than report it to author: https://github.com/renard314/textfairy He updates app regularly (last update was 28. 3. 2021). Zdenko pi 2. 4. 2021 o 10:34 two way napísal(a): > Thanks for your quick response. > The problem occurred while I am using textfairy

Re: [tesseract-ocr] pytesseract having high accuracy but performing very very slow

2021-03-25 Thread Zdenko Podobny
1 000 000 pages in one pdf? Seriously? + Post your code. pytesseract is not effective tool in case of multiple images (disk IO for each run/page) Zdenko št 25. 3. 2021 o 8:49 Vidya Chitragar < vidya.chitra...@lucidatechnologies.com> napísal(a): > Hi Every one. > I am using pytesseract with

Re: [tesseract-ocr] extract paragraphs from the scanned pdf

2021-03-13 Thread Zdenko Podobny
If you need help please provide an example of an input document, what you already did/code you have, what is expected output etc. Otherwise forum users will just consider your post as a statement and nobody will care. Zdenko so 13. 3. 2021 o 9:50 Ajeet Ojha napísal(a): > Hi All, I need to

Re: [tesseract-ocr] Cross compile Tesseract for arm64-v8a architecture

2021-01-26 Thread Zdenko Podobny
So you need to find out which header file from your aarch64-linux-android toolchain provide _get_cpuid(). Zdenko ut 26. 1. 2021 o 7:48 Hussain Akbar napísal(a): > > Yes i have tried by commenting the #include . further errors > like __get_cpuid(1, , , , ) != 0) not found. Its obvious >

Re: [tesseract-ocr] Cross compile Tesseract for arm64-v8a architecture

2021-01-25 Thread Zdenko Podobny
Try to comment line with #include to see if there are other errors. PS: there is no problem to build tesseract for arm devices like Rasberry pi, so maybe there is something specific for android... Zdenko po 25. 1. 2021 o 13:56 Hussain Akbar napísal(a): > > Please see this bug > >

Re: [tesseract-ocr] Removing colors

2021-01-07 Thread Zdenko Podobny
Unfortunately I am not aware of (maintained) python leptonica support (any volunteers?), but you can directly use leptonica via cffi in python. See some examples : https://sk-spell.sk.cx/building-minimalistic-tesseract

Re: [tesseract-ocr] Removing colors

2021-01-06 Thread Zdenko Podobny
try to play with the leptonica pixAutoPhotoinvert function[1]. quick test with following C code snippets provided attached result: pix = leptonica.pixRead("des_resume3.png"); pix1 = leptonica.pixThresholdToBinary(pix, 170); autoinverted = pixAutoPhotoinvert(pix1, thresh, NULL, NULL);

Re: [tesseract-ocr] unexpected performance on cropped image

2021-01-06 Thread Zdenko Podobny
did you tried suggestion proposed by documentation? On Wed, 6 Jan 2021, 14:27 zhenhao chen, wrote: > Hello > the performance on the image (attach below) is not good as expected > [image: 1.png] > [image: 2.png] > [image: 25.png] > I got windows10 and Tesseract '5.0.0-alpha.20201127' > is that a

Re: [tesseract-ocr] OCR only part of an scanned image

2021-01-05 Thread Zdenko Podobny
Can you share original images, so people can "play"? Zdenko pi 1. 1. 2021 o 21:17 Alex Santos napísal(a): > Hello > > I have an image like the one attached. My goal is to be able to select > only part of the scan for OCR. In my attachment, I marked up the areas in > red that I do not want to

Re: [tesseract-ocr] Tesseract works in debug, but fails in release

2020-12-31 Thread Zdenko Podobny
I remember the opposite situation (on windows):debug was crasshit while release was ok. I also remember some problems with static builds. try to build tesseract by yourself as described in link above and use a shared library - it is not a big deal. Maybe the problem is related with

Re: [tesseract-ocr] Tesseract works in debug, but fails in release

2020-12-31 Thread Zdenko Podobny
I am not able to reproduce the problem - but I do not use vcpkg (so maybe there is problem): 1. I used official opencv for windows https://netix.dl.sourceforge.net/project/opencvlibrary/4.5.1/opencv-4.5.1-vc14_vc15.exe -> Installed to F:\opencv2 2. Because of using opencv2 I prefer to use

Re: [tesseract-ocr] Tesseract v 5.0 on Linux

2020-12-31 Thread Zdenko Podobny
be released? > > > > Thank you > > > > *From:* tesseract-ocr@googlegroups.com *On > Behalf Of *Zdenko Podobny > *Sent:* Thursday, December 31, 2020 2:32 PM > *To:* tesseract-ocr@googlegroups.com > *Subject:* Re: [tesseract-ocr] Tesseract v 5.0 on Linux > > > &

Re: [tesseract-ocr] Tesseract works in debug, but not release build

2020-12-31 Thread Zdenko Podobny
1. Do not post code snittpet - provide full testing case for easy replication 2. Provide details: which OS, which compiler, how did build/get tesseract, which version... etc. Zdenko št 31. 12. 2020 o 20:53 Minseok Kim napísal(a): > std::string outText, imPath = "image.jpeg"; > cv::Mat im =

Re: [tesseract-ocr] Tesseract v 5.0 on Linux

2020-12-31 Thread Zdenko Podobny
Version 5 is not officially released and there are plenty of code changes (improvements) - API is not ready/finalized. So the answer is: nowhere. You have to build it by yourself. Zdenko št 31. 12. 2020 o 20:29 Peter Kronenberg napísal(a): > > Is there a way to get Tesseract 5.0 on Linux

Re: [tesseract-ocr] Tesseract not giving output

2020-11-18 Thread Zdenko Podobny
try doc: https://github.com/tesseract-ocr/tessdoc/blob/master/ImproveQuality.md Zdenko st 18. 11. 2020 o 16:36 agentbond009 napísal(a): > > I want to know the image requirements of tesseract like for some images it > doesn't give output. > like for example image attached here. > > It gives

Re: [tesseract-ocr] Tesseract remove space when I use LTSM mode

2020-11-03 Thread Zdenko Podobny
tesseract "executable" (which is also an example how to use the tesseract library) handles it correctly (for LSTM and legacy engine). So check the source code Zdenko ut 3. 11. 2020 o 12:45 Enzo Merotto napísal(a): > I'm not sure because in TESSERACT_ONLY mode there are spaces, so it works. >

Re: [tesseract-ocr] Tesseract remove space when I use LTSM mode

2020-11-03 Thread Zdenko Podobny
IMO that is problem of your code. Have a look at tesseract code how to handle spaces. Here is result for you image for different OEM: > tesseract test_2020-11-03_122112048.png - --oem 0 -l fra En votre aimable règlement, Cordialement, > tesseract test_2020-11-03_122112048.png - --oem 1 -l fra En

Re: [tesseract-ocr] Re: Tesseract use cmake & Visual studio2019 build show something error.

2020-11-03 Thread Zdenko Podobny
You can ignore that message. Some internal functions use zlib, png and tiff but have no effect on OCR. Your question (regarding leptonica and image types) indicates you are not familiar with building sw from source. In such case use tesseract installer from Mannheim University

Re: [tesseract-ocr] Tesseract remove space when I use LTSM mode

2020-11-03 Thread Zdenko Podobny
Please provide reproducible example of what you are doing, how, what is the result and desired result. Zdenko ut 3. 11. 2020 o 9:41 Enzo Merotto napísal(a): > Hello, > I have a problem with the ltsm mode because it do not detect space and > regroup every words in one. > Do you have an idea of

Re: [tesseract-ocr] Tesseract use cmake & Visual studio2019 build show something error.

2020-11-03 Thread Zdenko Podobny
As you see: you build leptonica without any external image library (like png, jpg, tiff), so tesseract can read only simple image format like bmp, pgm and ppm Zdenko ut 3. 11. 2020 o 9:41 吳明恩 napísal(a): > Environment > Tesseract Version: > 1.tesseract 4.1.1 > 2.leptonica-1.76.0 (Nov 3 2020,

Re: [tesseract-ocr] How to remove tesseract cleanly

2020-10-30 Thread Zdenko Podobny
make uninstall uninstall tesseract perfectly if you installed from source. If you still have an issue, this means you had multiple installations of tesseract in your system (e.g. installed in /usr and /usr/local or user in home directory...) , or you did some unwise operation like installing 4x

Re: [tesseract-ocr] how to see which fonts are used in .traineddata files

2020-10-23 Thread Zdenko Podobny
e.g. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.444.226=rep1=pdf https://arthurflor23.medium.com/text-segmentation-b32503ef2613 Zdenko pi 23. 10. 2020 o 5:05 H Brenner napísal(a): > Hi Zdenko, > > Per you suggestion I have installed the latest version of tesseract (Ver > 5),

Re: [tesseract-ocr] Re: Disable image rotation on Tesseract

2020-10-20 Thread Zdenko Podobny
try to provide input image. Then maybe somebody can find a solution. Zdenko ut 20. 10. 2020 o 14:57 Timo Laine napísal(a): > Would anyone have any ideas? > > - Timo > maanantai 19. lokakuuta 2020 klo 8.39.03 UTC+3 Timo Laine kirjoitti: > >> I'm convert TIF images to PDF files with Tesseract

Re: [tesseract-ocr] is there any way to extract underlined text from image?

2020-10-16 Thread Zdenko Podobny
with example (e.g. image) you have a higher chance for help. Zdenko št 15. 10. 2020 o 14:33 Mitesh Gabani napísal(a): > i want to extract underlined text from image using tesseract. > please suggest if there is any way to detect underlined text from image. > > -- > You received this message

Re: [tesseract-ocr] how to see which fonts are used in .traineddata files

2020-10-03 Thread Zdenko Podobny
1. try the latest version 2. try play with psm: e.g. tesseract 20201002.png - --psm 11 --dpi 300 produces: 8 27 26 10 04 03 01 N29 19 16 14 09 03 131 27 25 18 12 03 N21 18 16 13 07 04 N32 232112 10 07 N 36 34 30 27 21 01 X35 3417 13 10 08 N36 33 29 28 14 09 R 33 32 31 21 06 01 - oe

Re: [tesseract-ocr] OMP_THREAD_LIMIT=1 gives improvement in 4.1 version

2020-10-02 Thread Zdenko Podobny
this > contradiction, please comment on this? > > > > > > > > > >> >> > > >> Can you comment on following question >> If we didn't set up the OMP_THREAD_LIMIT, does it enable multithreading >> in Tesseract-4.1? >> >> >>

Re: [tesseract-ocr] Guidance for not recognized text

2020-10-02 Thread Zdenko Podobny
https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html Algorithm responsible for providing OCR results for "inverted images" is not reliable in tesseracrt >=4 (or LSTM engine only?)... Zdenko št 1. 10. 2020 o 21:55 Jean-Marc Spaggiari napísal(a): > I was curious as why it works super

Re: [tesseract-ocr] OMP_THREAD_LIMIT=1 gives improvement in 4.1 version

2020-09-30 Thread Zdenko Podobny
1. OMP_THREAD_LIMIT is an environment variable so it affects "everything" not only tesseract. 2. How did you measure performance? Provide details including OS, hw etc. Zdenko st 30. 9. 2020 o 14:05 Sarath C P napísal(a): > OpenMP disabled in tesseract-ocr default. but when I am seeting >

Re: [tesseract-ocr] How do I only detect text of one size?

2020-09-25 Thread Zdenko Podobny
Maybe it would be good to provide some examples of input. Zdenko pi 25. 9. 2020 o 7:57 Radu Stoicescu napísal(a): > I have some scanned, machine typed, that have a lot of noise. I can reduce > the noise, and I have done so. But there is some noise that is > statistically indistinguishable

Re: [tesseract-ocr] Re: Can not recognize white foreground texts.

2020-09-24 Thread Zdenko Podobny
https://github.com/tesseract-ocr/tessdoc/blob/master/ImproveQuality.md#inverting-images Zdenko št 24. 9. 2020 o 8:16 Rapht napísal(a): > for example. > > [image: 3.png] > 24 Eylül 2020 Perşembe tarihinde saat 08:27:06 UTC+3 itibarıyla Rapht > şunları yazdı: > >> Hi, I do use my friend's

Re: [tesseract-ocr] Tesseract hardware minimun requirements

2020-09-24 Thread Zdenko Podobny
Your request is very unclear: if you are able to run OCR with tesseract, you already met requirements. ;-) There are people running tesseract on raspberry pi. There are people using tesseract on mobile (e.g. https://play.google.com/store/apps/details?id=com.renard.ocr) Tesseract speed depends

Re: [tesseract-ocr] building tesseract for online hosting

2020-09-08 Thread Zdenko Podobny
-linux x64-mingw-static arm64-osx x86-mingw-dynamic x64-mingw-dynamic arm-mingw-dynamic arm64-windows-static Zdenko ut 8. 9. 2020 o 14:13 Zdenko Podobny napísal(a): > I did not try it on linux, but you can try to use Microsoft vcpkg[1]to > build static leptonica[2]... (on w

Re: [tesseract-ocr] building tesseract for online hosting

2020-09-08 Thread Zdenko Podobny
. 2020 o 13:56 Zdenko Podobny napísal(a): > As I mentioned in a previous email: you need to build a static leptonica > library with all its dependencies (image libraries) as static libraries. > Maybe it would require more tweaking ;-) : > > > ldd /usr/bin/tesseract >

Re: [tesseract-ocr] building tesseract for online hosting

2020-09-08 Thread Zdenko Podobny
As I mentioned in a previous email: you need to build a static leptonica library with all its dependencies (image libraries) as static libraries. Maybe it would require more tweaking ;-) : > ldd /usr/bin/tesseract linux-vdso.so.1 (0x7ffdd44e2000) libtesseract.so.5 =>

Re: [tesseract-ocr] building tesseract for online hosting

2020-09-08 Thread Zdenko Podobny
Static building is a general topic and not tesseract specific. You need to consult documentation for your build chain. E.g. good startpoint is ./configure --help if you use autotools Linux (so maybe also MAc) prefers to build only shared versions of libraries, so you need to re-build also all

Re: [tesseract-ocr] building tesseract for online hosting

2020-09-07 Thread Zdenko Podobny
try to build static version of tesseract Zdenko po 7. 9. 2020 o 10:48 Mobeen Ali napísal(a): > Hi! > I was trying to build and upload tesseract online. I've somehow found the > tesseract but I'm getting this error: > > "pytesseract.pytesseract.TesseractError: (127, >

Re: [tesseract-ocr] Unable to tesseract text from cropped image

2020-08-18 Thread Zdenko Podobny
Follow documentation: https://github.com/tesseract-ocr/tessdoc/blob/master/ImproveQuality.md e.g. you did not crop the image properly - there are still black folders, there are graphical elements (signature). Also the last line with different fonts/size will fool OCR - you need to implement

Re: [tesseract-ocr] How to integrate tesseract to vs2019 mfc

2020-08-15 Thread Zdenko Podobny
Tesseract is a library like any other opensource library (zlib, png, opencv...), so you should integrate it like any other library to your project. If you ask regarding how to use this library, then I'm sure you already read documentation , so I will not

Re: [tesseract-ocr] Unable to successfully process 'make leptonica tesseract'

2020-08-12 Thread Zdenko Podobny
Do not post screenshot of textual output. Provide full logs of whole process. Zdenko st 12. 8. 2020 o 11:44 Sawan Kumar napísal(a): > Hello All, > > I cloned 'https://github.com/tesseract-ocr/tesstrain.git' at one of > directory in desktop. I installed all dependencies as mentioned in >

Re: [tesseract-ocr] Getting started with contributions

2020-08-06 Thread Zdenko Podobny
just send pull requests to github repository. Zdenko št 6. 8. 2020 o 7:34 Uddeshya Tyagi napísal(a): > Hello developers! I'm Uddeshya Tyagi,a computer science student from > Jiit,Noida,India.I recently learnt basics of *tesseract* library.I,now > want to *contribute* to this project,so please

Re: [tesseract-ocr] How can I use tesseract library in Visual Studio?

2020-08-04 Thread Zdenko Podobny
Did you try to look at documentation? Zdenko ut 4. 8. 2020 o 20:22 Kirankumar Chincholi napísal(a): > Hello everyone, > I hope everyone is fine and safe, I am Kiran,I just tried some basic > openCV tutorial using Visual Studio 2019. Now, I need to extract text from > images by using tesseract

Re: [tesseract-ocr] Train for big letters in the beginning of the sentences(pic)

2020-08-04 Thread Zdenko Podobny
Not sure what do you mean... tesseract big_low.jpeg - --psm 6 Warning: Invalid resolution 0 dpi. Using 70 instead. FY, MINERS.—TO LET, ON LEASE, on such terms as may be agreed on, the MINERALS in the ESTATE of KNOCKSHINNOCK, lying in the parish of New Cumnock, and county of Ayr. Acdead vein has

Re: [tesseract-ocr] Are character bboxes trustworthy?

2020-07-25 Thread Zdenko Podobny
As I mentioned, if you need good bounding boxes you have to use a legacy engine. There are several issues & comments why it is problem to get accurate bounding boxes e.g. https://github.com/tesseract-ocr/tesseract/issues/2825#issuecomment-579220987 Zdenko so 25. 7. 2020 o 0:44

Re: [tesseract-ocr] Are character bboxes trustworthy?

2020-07-24 Thread Zdenko Podobny
Do you use lstm or legacy engine? If lstm: search issue tracker/PR/(forum?) for bounding box problem (and Noah Metzger patches) There are rumours that if you need really good bounding boxes you have to use the latest 3.5 version because changes in the 4.x version (and later) also affected

Re: [tesseract-ocr] tessaract ocr on capcha images--how to perform well?

2020-07-15 Thread Zdenko Podobny
there is albostultelly no intention to help you (or others) to use OCR for breaking captcha. Zdenko ut 14. 7. 2020 o 19:53 Omar Hasan napísal(a): > Hello! I am trying to run ocr on capcha images. well, for normal images > tessaract performs well, but for images below attachments, it performs

Re: [tesseract-ocr] Tesseract makes different predictions on seemingly equal images. How to make it more robust?

2020-07-14 Thread Zdenko Podobny
Try to use the latest version of tesseract. Zdenko ut 14. 7. 2020 o 16:04 MysteriousGuy napísal(a): > I am using Tesseract to extract text from images attached. For some > reason, even though the images are nearly identical, tesseract makes a > mistake in one of them: for 'bad.png' the output

Re: [tesseract-ocr] Anyway to disable internal image preprocessing? (internal operations make really BAD result)

2020-07-03 Thread Zdenko Podobny
First of all: you do not mention any important information like which tesseract version you use, which language model etc. Next: " -c tessedit_write_image=1" produces Could not set option: tessedit_write_image=1 ;-) Next: If you want to avoid tesseract binarization (Otsu), you must provide realy

Re: [tesseract-ocr] training the layout/segmentation/word detection engine

2020-07-01 Thread Zdenko Podobny
Try this: https://github.com/Sintun/PersonalHelperPrograms/blob/master/Tesseract/tess.cpp Longer story: https://github.com/tesseract-ocr/tesseract/issues/1714 Zdenko st 1. 7. 2020 o 10:29 amit...@gmail.com napísal(a): > I want to optimise tesseract 4 (lstm) for a set of documents I have. > I

Re: [tesseract-ocr] Optimize tesseract

2020-06-26 Thread Zdenko Podobny
There is no magic command/parameter that solves issues like this. And you did not provide enough information (e.g. what it is "python tesseract 4.0") to analyze whether you follow best practices. If you are really interested in help, you have to provide more information (e.g. HW specification,

Re: [tesseract-ocr] Why does tessaract fail on this image?

2020-06-12 Thread Zdenko Podobny
search for forum/issue tracker - there is explanation why LSTM can not exact character box coordinates. If you need exact character boxes IMO you need to use legacy engine (but it could have other problems) Zdenko pi 12. 6. 2020 o 12:31 'Tariq Ahmad' via tesseract-ocr <

Re: [tesseract-ocr] Why does tessaract fail on this image?

2020-06-11 Thread Zdenko Podobny
https://github.com/tesseract-ocr/tessdoc/blob/master/ImproveQuality.md#missing-borders Zdenko st 10. 6. 2020 o 18:50 'Tariq Ahmad' via tesseract-ocr < tesseract-ocr@googlegroups.com> napísal(a): > I cannot understand whyTessaract fails on this (cropped) image: > > > Yet if i add a random

Re: [tesseract-ocr] Using libtesseract in Windows for screenshot OCR

2020-05-17 Thread Zdenko Podobny
ne 17. 5. 2020 o 13:03 David Varns napísal(a): > > I am building some tools to extract text data from screenshots, which is a > simple (easy) case of OCR. (It is part of a platform for automated testing > of software, the tester interacts with UI elements and we want to be able > to read things

Re: [tesseract-ocr] Tessaract not able to output detected text

2020-04-29 Thread Zdenko Podobny
changing > different parameters. However I am still not able to get text from the > image. Attached my pre-processing code, which I am running before using > tesseract. But however I am unable to get text still. Please help. > > On Tue, 28 Apr 2020 at 23:57, Zdenko Podobny wrote: > >>

Re: [tesseract-ocr] Tessaract not able to output detected text

2020-04-28 Thread Zdenko Podobny
https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html Zdenko ut 28. 4. 2020 o 20:26 payel roy napísal(a): > Hi Team, > > I am new to Tessaract. Following the code snippet. While running it, I > can't get result back from Tesseract on the detect texts. Please help. > > #!/usr/bin/python

Re: [tesseract-ocr] What is the working process of doing multiple images OCR using imagelist.txt

2020-04-17 Thread Zdenko Podobny
It loops over filelist [1]: processing one filename at time. [1] https://github.com/tesseract-ocr/tesseract/blob/cdebe13d81e2ad2a83be533886750f5491b25262/src/api/baseapi.cpp#L1007 Zdenko pi 17. 4. 2020 o 12:42 mit napísal(a): > Hi, > > I want to know the internal memory working of tesseract

Re: [tesseract-ocr] 2 min on 1 page TIFF using Fast trained data

2020-04-15 Thread Zdenko Podobny
Just for future reference: for AVX (and ...) support there is needed to rebuild only tesseract - it depends on compiler and HW. Of course it make sense to use the latest version of tesseract dependencies (because of security, bugfixes etc) , but they have (AFAIK) minimum effect on tesseract speed

Re: [tesseract-ocr] 2 min on 1 page TIFF using Fast trained data

2020-04-14 Thread Zdenko Podobny
Without AVX support tesseract 4/5 will be slow(er). So try to focus on this. Using more than one lang will slower OCR too... Zdenko ut 14. 4. 2020 o 5:56 Ravil R napísal(a): > Oh you gave so much info, thanks! > My test exe file shows this version information: > tesseract 5.0.0 >

Re: [tesseract-ocr] 2 min on 1 page TIFF using Fast trained data

2020-04-13 Thread Zdenko Podobny
OS Name: Microsoft Windows 10 Pro OS Version:10.0.18362 N/A Build 18362 System Model: Latitude E5570 System Type: x64-based PC Processor(s): 1 Processor(s) Installed. [01]: Intel64 Family 6 Model

Re: [tesseract-ocr] 2 min on 1 page TIFF using Fast trained data

2020-04-13 Thread Zdenko Podobny
Why you decided to ignore instructions in comment https://github.com/tesseract-ocr/tesseract/issues/2946#issuecomment-612613461 ? Why we should care about your problems if you do not care? Zdenko ne 12. 4. 2020 o 16:00 Ravil R napísal(a): > I have my own simple Windows dll based on

Re: [tesseract-ocr] How to split a3 in single page

2020-04-07 Thread Zdenko Podobny
no. Tesseract is OCR engine and not image processing tool. Pdf export strictly follow rule to not modify input image e.g. you have this need you need to use other tools to create pdf. Zdenko po 6. 4. 2020 o 23:51 Teo napísal(a): > I've this page, can I split this A3 scan in 2 A4, during the

Re: [tesseract-ocr] The text is not recognized from png

2020-04-07 Thread Zdenko Podobny
You can start with reading docs and then searching issue tracker and forum for "table". Zdenko ut 7. 4. 2020 o 7:38 amrapalli karan napísal(a): > I have this .pdf file which I am able to read only partially. I am using R > language to fetch the data from the pdf file which is uploaded in the

Re: [tesseract-ocr] Can anyone tell the the improvement in 5.0.0-alpha

2020-04-02 Thread Zdenko Podobny
Just quick reply: Master branch (a.k.a 5.0.0-alpha) is development branch e.g. things there could be broken (e.g. build system or compatibility) ;-) Current stable branch/version is 4.1 where most patches from master were backported. If I remember correctly: differences between master and 4.1

Re: [tesseract-ocr] Re: Scan pdf file instead png

2020-03-28 Thread Zdenko Podobny
Tesseract is OCR images not documents (pdf, docx, odt etc..) If you need multipage support use tif image format instead of pdf for scanning. Zdenko so 28. 3. 2020 o 20:42 Essam Zaky napísal(a): > What do you mean by "scan a pdf " ? > If you mean recognize pdf file , you can not recognize pdf

Re: [tesseract-ocr] Best export method

2020-03-19 Thread Zdenko Podobny
Checkout output to hocr (which is html output), tsv or pdf. See doc. Zdenko št 19. 3. 2020 o 8:04 Dayton napísal(a): > Hi All, > > I´m using Tesseract for Windows to OCR scanned documents and then format > the layout in Word in a later stage. > > The text extraction that I get in the .TXT

Re: [tesseract-ocr] Trying to build with OpenMP

2020-03-17 Thread Zdenko Podobny
Which version you try build (master) ? Did you search issue tracker / forum for problems with openmp? (There is a reason why it is turn off by default). Zdenko ut 17. 3. 2020 o 10:15 Jerry Andersson napísal(a): > Hello, I am trying to build with openmp and sw on a windows 7 machine but > I

Re: [tesseract-ocr] Tesseract unable to read simple image correctly

2020-03-09 Thread Zdenko Podobny
Please write us what did you already tried from tesseract documentation. Zdenko po 9. 3. 2020 o 10:02 Velectico Consulting napísal(a): > *Environment* > Tesseract Version: tesseract v5.0.0-alpha.20200223 > Platform: Windows 64-bit > > *Problem: * > The attached image below is not read

Re: [tesseract-ocr] Supplying a different DPI param per page

2020-03-09 Thread Zdenko Podobny
Just quick replay (I did not test it :-) ): - tiff is"container of images" and AFAIK each image can have its own resolution (DPI is just information for correct printing/displaying of image) - tesseract should read multi-page tiff image-by-image and process it individually

Re: [tesseract-ocr] WinError 5 PermissionError on Windows 10

2020-03-01 Thread Zdenko Podobny
anyway report it to pytesseract project, so it can be fixed - otherwise next update will bring it once again. Zdenko ne 1. 3. 2020 o 18:17 Supharerk Thawillarp napísal(a): > After diving in pytesseract.py I found one possible related issue in > the NamedTemporaryFile. > > According to the

Re: [tesseract-ocr] WinError 5 PermissionError on Windows 10

2020-03-01 Thread Zdenko Podobny
Hello, I am not able to reproduce error, errors come from here [1] where pytesseract tries to cleanup temporary files. You should report it to pytesseract project as there is no option to skip this code. Maybe you can try to modify this part of pytesseact code[2]: finally: cleanup(f.name)

Re: [tesseract-ocr] WinError 5 PermissionError on Windows 10

2020-02-29 Thread Zdenko Podobny
1. Make sure you have the latest version of tesseract. Then try this script and provide exact/full error message: import tempfile import cv2 import pytesseract from PIL import Image from pytesseract import Output pytesseract.pytesseract.tesseract_cmd = 'C:\\Program

Re: [tesseract-ocr] WinError 5 PermissionError on Windows 10

2020-02-29 Thread Zdenko Podobny
This means there is problem with pytesseract/python permissions. Can you get output for pytesseract.get_tesseract_version()? Zdenko so 29. 2. 2020 o 12:10 Supharerk Thawillarp napísal(a): > No, the tesserect successfully run with output generated in textfile. > > (base) PS

Re: [tesseract-ocr] WinError 5 PermissionError on Windows 10

2020-02-29 Thread Zdenko Podobny
Can you replicate problem with command line /"pure" tesseract? e,g, 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe' images/invoice-sample.jpg invoice-sample Zdenko pi 28. 2. 2020 o 20:31 Supharerk Thawillarp napísal(a): > > I'm new to tesseract and trying to follow tutorial on Windows 10

Re: [tesseract-ocr] Re: how to use tesseract to detect table?

2020-02-26 Thread Zdenko Podobny
test image is https://miro.medium.com/max/2136/1*VdPb4yKCkz1RhXfafPw_rA.png ;-) Zdenko st 26. 2. 2020 o 8:03 Zdenko Podobny napísal(a): > Article points to this code on github: > https://github.com/huks0/tablerecognition/blob/master/celldetectextract.py > > > Zdenko > >

Re: [tesseract-ocr] Re: how to use tesseract to detect table?

2020-02-26 Thread Zdenko Podobny
maybe have a look at https://github.com/tesseract-ocr/tesseract/issues/1714#issuecomment-588180969 (I have no time to test it yet) Zdenko st 26. 2. 2020 o 11:02 KOLLOL CHOWDHURY napísal(a): > Does anyone have solution to this? In the newer tesseract(4.x), the > option

Re: [tesseract-ocr] Re: how to use tesseract to detect table?

2020-02-25 Thread Zdenko Podobny
Article points to this code on github: https://github.com/huks0/tablerecognition/blob/master/celldetectextract.py Zdenko st 26. 2. 2020 o 7:41 Essam Zaky napísal(a): > would you download the article you described and attach it here , because > the medium site needs payed registration > > ‫في

Re: [tesseract-ocr] tesseract 3.3.0 always misinterpret few characters (desperate right now ...)

2020-02-20 Thread Zdenko Podobny
What is tesseract 3.3.0? I did not find it in https://github.com/tesseract-ocr/tesseract/releases Or did you mean 3.03-rc1 release on on Sep 20, 2014 ? Zdenko št 20. 2. 2020 o 14:25 Justin Yeh napísal(a): > Unfortunately tesseract 3.3.0 keeps misinterpreting characters such as B > and 8, or

Re: [tesseract-ocr] Re: Using tesseract on browser page insufficient

2020-02-20 Thread Zdenko Podobny
Why we should document how to use Ubuntu? You should be familiar with your OS. PPA repositories for each tesseract version are listed on https://tesseract-ocr.github.io/tessdoc/Home.html Zdenko št 20. 2. 2020 o 9:20 Alexander Dietz napísal(a): > With an update to version 4 (undocumented

Re: [tesseract-ocr] Tesseract OpenCL Selects Wrong Compute Device

2020-02-18 Thread Zdenko Podobny
Search forum and issue tracker for opencl topic. Zdenko st 19. 2. 2020 o 8:27 Tim Finnegan napísal(a): > I'm attempting to run GPU Acceleration during training using the OpenCL > libraries. > > I have built tesseract to use openCL, and installed the NVidia Compute > driver 440 on my Ubuntu

Re: [tesseract-ocr] Removing diagonal Text that intersect with the horizontal text I want to read.

2020-02-08 Thread Zdenko Podobny
Can you share original pdf to investigate if the problem could not be solved on pdf level (e.g. extract image from pdf without watermark) ? Your problem is not related to tesseract (or other way - tesseract is not tool that helps you remove watermark), so better option would be to post in on

Re: [tesseract-ocr] Re: tesseract ocr to pdf from .tif file send from fax machine

2020-02-07 Thread Zdenko Podobny
You can build it by yourself, or to wait until your packager build it, you can try to use appveyor artifact [1] from the last commit. [1] https://ci.appveyor.com/project/zdenop/tesseract/build/job/66l95n7ofxrs0xtf/artifacts Zdenko pi 7. 2. 2020 o 15:56 George Varghese napísal(a): > I found

<    1   2   3   4   5   6   7   8   9   10   >