Re: [tesseract-ocr] Detecting language automatically

2021-03-25 Thread Charles Cho
Hi, >>>The OSD module does not detect language - it detect script, as you also >>>noted earlier: It detects language by using OSD in tesseract and tesseract also provides DetectOrientationScript function. api.Init("/Users/renard/devel/textfairy/tessdata", "osd",

Re: [tesseract-ocr] Detecting language automatically

2021-03-25 Thread Merlijn B.W. Wajer
Hi, On 25/03/2021 19:04, Charles Cho wrote: > Hi. > > Thank you very much for your kind help, shree. > I tried to detect script by your help and it worked. Great. > > I have some questions. > 1. If the image contains texts of different languages in a page, is there > any way to detect all of

Re: [tesseract-ocr] Detecting language automatically

2021-03-25 Thread Charles Cho
Hi. Thank you very much for your kind help, shree. I tried to detect script by your help and it worked. Great. I have some questions. 1. If the image contains texts of different languages in a page, is there any way to detect all of the languages? Now it detects only one language. 2. It detects

Re: [tesseract-ocr] Detecting language automatically

2021-03-25 Thread shree
See https://github.com/tesseract-ocr/tessdoc/blob/master/examples/OSD_example.cc //Get OSD - new code int orient_deg; float orient_conf; const char* script_name; float script_conf; api->DetectOrientationScript(_deg, _conf, _name, _conf); printf("\n Orientation

Re: [tesseract-ocr] pytesseract having high accuracy but performing very very slow

2021-03-25 Thread Zdenko Podobny
1 000 000 pages in one pdf? Seriously? + Post your code. pytesseract is not effective tool in case of multiple images (disk IO for each run/page) Zdenko št 25. 3. 2021 o 8:49 Vidya Chitragar < vidya.chitra...@lucidatechnologies.com> napísal(a): > Hi Every one. > I am using pytesseract with

Re: [tesseract-ocr] Detecting language automatically

2021-03-25 Thread Charles Cho
Hi, I have investigated on trying to detect language automatically. I referred to these links. Thank you, Merlijin. https://archive.org/services/docs/api/ocr.html#autonomous-mode https://git.archive.org/www/tesseract/-/blob/master/main.py#L757 So in my analysis, it used OSD of tesseract engine

Re: [tesseract-ocr] pytesseract having high accuracy but performing very very slow

2021-03-25 Thread Shree Devi Kumar
Try with newer version of tesseract. On Thu, Mar 25, 2021, 13:19 Vidya Chitragar < vidya.chitra...@lucidatechnologies.com> wrote: > Hi Every one. > I am using pytesseract with tesseract-ocr version 3.05.02 for conversion > of scanned pdf document of 1000k pages to searchable pdf document but my

Re: [tesseract-ocr] Pytesseract processing images already in memory

2021-03-25 Thread Lorenzo Bolzani
Try tesserocr, a real binding library. Bye Lorenzo Il giorno gio 25 mar 2021 alle ore 05:44 Alex Zetaeffesse ha scritto: > Hi all, > > I'm already using a python library (pyvips) for cropping images with text > inside. > Is there a way to have Pytesseract process images in memory without the

[tesseract-ocr] pytesseract having high accuracy but performing very very slow

2021-03-25 Thread Vidya Chitragar
Hi Every one. I am using pytesseract with tesseract-ocr version 3.05.02 for conversion of scanned pdf document of 1000k pages to searchable pdf document but my code is taking more than 5 to 6 hrs to give searcable pdf document , Any suggestions are very helpful to me Thanks, Vidya -- You