Re: [tesseract-ocr] Detecting language automatically

2021-03-25 Thread Charles Cho
Hi, 

>>>The OSD module does not detect language - it detect script, as you also
>>>noted earlier:
It detects language by using OSD in tesseract and tesseract also provides 
DetectOrientationScript function.

api.Init("/Users/renard/devel/textfairy/tessdata", "osd", 
tesseract::OcrEngineMode::OEM_DEFAULT);
api.SetPageSegMode(tesseract::PageSegMode::PSM_OSD_ONLY);
api.SetImage(pix);
api.DetectOrientationScript(_deg, _conf, _name, 
_conf);  

After this, script_name will get language name and script_conf will get 
confidence value.
As I tested several languages, scipt_name gets following values.
English -> 'Latin'
French->'Latin'
German->'Latin'
Chinese_Sim -> 'Han'
Chinese_Tra -> 'Han'
Korean -> 'Korean'
Japanese -> 'Japanese'
Russian -> 'Cyrillic'

So the problem is that I want to distinguish Latin languages exactly and I 
want to  detects several languages once from an image.

Thanks.
Best,
Charles.
On Friday, March 26, 2021 at 2:33:26 AM UTC+8 Merlijn Wajer wrote:

> Hi, 
>
> On 25/03/2021 19:04, Charles Cho wrote: 
> > Hi. 
> > 
> > Thank you very much for your kind help, shree. 
> > I tried to detect script by your help and it worked. Great. 
> > 
> > I have some questions. 
> > 1. If the image contains texts of different languages in a page, is 
> there 
> > any way to detect all of the languages? Now it detects only one 
> language. 
> > 2. It detects English, German, French as 'Latin'. So how can I 
> distinguish 
> > the languages exactly? 
>
> The OSD module does not detect language - it detect script, as you also 
> noted earlier: 
>
> >>> So in my analysis, it used OSD of tesseract engine to detect layout 
> and 
> >>> script. 
> >>> After detect script, it detects languages on the script. 
>
> What's missing is performing OCR using just the script - and then 
> analysing the corpus to detect the language. 
>
> You could use something like this: https://github.com/saffsd/langid.c 
>
> Regards, 
> Merlijn 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7deebf13-4422-458d-a81f-a081e740d549n%40googlegroups.com.


Re: [tesseract-ocr] Detecting language automatically

2021-03-25 Thread Merlijn B.W. Wajer
Hi,

On 25/03/2021 19:04, Charles Cho wrote:
> Hi.
> 
> Thank you very much for your kind help, shree.
> I tried to detect script by your help and it worked. Great.
> 
> I have some questions.
> 1. If the image contains texts of different languages in a page, is there 
> any way to detect all of the languages? Now it detects only one language.
> 2. It detects English, German, French as 'Latin'. So how can I distinguish 
> the languages exactly?

The OSD module does not detect language - it detect script, as you also
noted earlier:

>>> So in my analysis, it used OSD of tesseract engine to detect layout and
>>> script.
>>> After detect script, it detects languages on the script.

What's missing is performing OCR using just the script - and then
analysing the corpus to detect the language.

You could use something like this: https://github.com/saffsd/langid.c

Regards,
Merlijn

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/35b6efd2-109f-06a3-6af9-7c8619a52dc3%40archive.org.


Re: [tesseract-ocr] Detecting language automatically

2021-03-25 Thread Charles Cho
Hi.

Thank you very much for your kind help, shree.
I tried to detect script by your help and it worked. Great.

I have some questions.
1. If the image contains texts of different languages in a page, is there 
any way to detect all of the languages? Now it detects only one language.
2. It detects English, German, French as 'Latin'. So how can I distinguish 
the languages exactly?

Thanks.
Best,
Charles.

On Thursday, March 25, 2021 at 9:49:10 PM UTC+8 shree wrote:

> See 
> https://github.com/tesseract-ocr/tessdoc/blob/master/examples/OSD_example.cc
>
> //Get OSD - new code
> int orient_deg;
> float orient_conf;
> const char* script_name;
> float script_conf;
> api->DetectOrientationScript(_deg, _conf, _name, 
> _conf);
> printf("\n Orientation in degrees: %d\n Orientation 
> confidence: %.2f\n"
> " Script: %s\n Script confidence: %.2f\n",
> orient_deg, orient_conf,
> script_name, script_conf);
>
> On Thursday, March 25, 2021 at 2:11:42 PM UTC+5:30 charles...@gmail.com 
> wrote:
>
>> Hi,
>>
>> I have investigated on trying to detect language automatically.
>> I referred to these links. Thank you, Merlijin.
>> https://archive.org/services/docs/api/ocr.html#autonomous-mode
>> https://git.archive.org/www/tesseract/-/blob/master/main.py#L757
>>
>> So in my analysis, it used OSD of tesseract engine to detect layout and 
>> script.
>> After detect script, it detects languages on the script.
>>
>> So I tried to use OSD engine mode based on textfairy which is Android OCR 
>> app based on tesseract 4.1.1.
>> But it doesn't work and I can't make sure how I can use OSD engine mode 
>> in Android.
>> I set 'osd' as language option string and used osd.traindata and set 
>> 'OEM_OSD_ONLY' as engine mode.
>> But it doesn't work.
>>
>> Hope anyone can help you to use OSD engine mode in Android.
>>
>> Thank you.
>> Best,
>> Charles.
>>
>> On Monday, March 22, 2021 at 10:28:38 AM UTC+8 Charles Cho wrote:
>>
>>> Hi, Merlijn.
>>>
>>> Thanks for your kind response.
>>>
>>> Regarding autonomous mode, I'm trying to find such module for Android.
>>> But I found nothing. I will try more.
>>>
>>> >I am not sure what you're finding on google play store, but I have found
>>> >there to be no limitation to the amount of languages that can be used
>>> >during OCR. Keep in mind that using more languages will slow down the
>>> >OCR process.
>>> It's textfairy, open source app.
>>> https://play.google.com/store/apps/details?id=com.renard.ocr
>>>
>>> Your response is really helpful.
>>>
>>> Best,
>>> Charles.
>>> On Sunday, March 21, 2021 at 8:29:13 AM UTC+8 Merlijn Wajer wrote:
>>>
 Hi, 

 On 19/03/2021 10:11, Charles Cho wrote: 
 > Hello, 
 > I'm working on a ocr android app based on tesseract. 
 > I want to add feature that detects language automatically and 
 recognize 
 > at least 2 languages at once. 
 > I have investigated on that for a while so I know that I have to 
 specify 
 > language for tesseract. 
 > Then how can I implement auto detection of language? 

 Not exactly a mobile use case, but you can read how the Internet 
 Archive 
 does this (I coined it "autonomous mode", where the software just 
 figures out the scripts and languages): 

 https://archive.org/services/docs/api/ocr.html#autonomous-mode 

 And the code is available, here (I plan to split out the archive.org 
 specific code from the python code that invokes Tesseract and performs 
 heuristics like script detection): 

 https://git.archive.org/www/tesseract/-/blob/master/main.py#L757 

 the tl;dr is to first perform script detection, and use the detected 
 script to OCR the page - then use language detection libraries to guess 
 the languages on the page. 

 > And tesseract on google play store can recognize 3 languages at once. 
 > Is it maximum? 

 I am not sure what you're finding on google play store, but I have 
 found 
 there to be no limitation to the amount of languages that can be used 
 during OCR. Keep in mind that using more languages will slow down the 
 OCR process. 

 > Any help and advice would be really appreciated. 

 Hope this helps. 

 Cheers, 
 Merlijn 

>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c6c896fc-5e0c-40b6-af7f-f66c424ecd7cn%40googlegroups.com.


Re: [tesseract-ocr] Detecting language automatically

2021-03-25 Thread shree
See https://github.com/tesseract-ocr/tessdoc/blob/master/examples/OSD_example.cc

//Get OSD - new code
int orient_deg;
float orient_conf;
const char* script_name;
float script_conf;
api->DetectOrientationScript(_deg, _conf, _name, 
_conf);
printf("\n Orientation in degrees: %d\n Orientation 
confidence: %.2f\n"
" Script: %s\n Script confidence: %.2f\n",
orient_deg, orient_conf,
script_name, script_conf);

On Thursday, March 25, 2021 at 2:11:42 PM UTC+5:30 charles...@gmail.com 
wrote:

> Hi,
>
> I have investigated on trying to detect language automatically.
> I referred to these links. Thank you, Merlijin.
> https://archive.org/services/docs/api/ocr.html#autonomous-mode
> https://git.archive.org/www/tesseract/-/blob/master/main.py#L757
>
> So in my analysis, it used OSD of tesseract engine to detect layout and 
> script.
> After detect script, it detects languages on the script.
>
> So I tried to use OSD engine mode based on textfairy which is Android OCR 
> app based on tesseract 4.1.1.
> But it doesn't work and I can't make sure how I can use OSD engine mode in 
> Android.
> I set 'osd' as language option string and used osd.traindata and set 
> 'OEM_OSD_ONLY' as engine mode.
> But it doesn't work.
>
> Hope anyone can help you to use OSD engine mode in Android.
>
> Thank you.
> Best,
> Charles.
>
> On Monday, March 22, 2021 at 10:28:38 AM UTC+8 Charles Cho wrote:
>
>> Hi, Merlijn.
>>
>> Thanks for your kind response.
>>
>> Regarding autonomous mode, I'm trying to find such module for Android.
>> But I found nothing. I will try more.
>>
>> >I am not sure what you're finding on google play store, but I have found
>> >there to be no limitation to the amount of languages that can be used
>> >during OCR. Keep in mind that using more languages will slow down the
>> >OCR process.
>> It's textfairy, open source app.
>> https://play.google.com/store/apps/details?id=com.renard.ocr
>>
>> Your response is really helpful.
>>
>> Best,
>> Charles.
>> On Sunday, March 21, 2021 at 8:29:13 AM UTC+8 Merlijn Wajer wrote:
>>
>>> Hi, 
>>>
>>> On 19/03/2021 10:11, Charles Cho wrote: 
>>> > Hello, 
>>> > I'm working on a ocr android app based on tesseract. 
>>> > I want to add feature that detects language automatically and 
>>> recognize 
>>> > at least 2 languages at once. 
>>> > I have investigated on that for a while so I know that I have to 
>>> specify 
>>> > language for tesseract. 
>>> > Then how can I implement auto detection of language? 
>>>
>>> Not exactly a mobile use case, but you can read how the Internet Archive 
>>> does this (I coined it "autonomous mode", where the software just 
>>> figures out the scripts and languages): 
>>>
>>> https://archive.org/services/docs/api/ocr.html#autonomous-mode 
>>>
>>> And the code is available, here (I plan to split out the archive.org 
>>> specific code from the python code that invokes Tesseract and performs 
>>> heuristics like script detection): 
>>>
>>> https://git.archive.org/www/tesseract/-/blob/master/main.py#L757 
>>>
>>> the tl;dr is to first perform script detection, and use the detected 
>>> script to OCR the page - then use language detection libraries to guess 
>>> the languages on the page. 
>>>
>>> > And tesseract on google play store can recognize 3 languages at once. 
>>> > Is it maximum? 
>>>
>>> I am not sure what you're finding on google play store, but I have found 
>>> there to be no limitation to the amount of languages that can be used 
>>> during OCR. Keep in mind that using more languages will slow down the 
>>> OCR process. 
>>>
>>> > Any help and advice would be really appreciated. 
>>>
>>> Hope this helps. 
>>>
>>> Cheers, 
>>> Merlijn 
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/20bdef8f-a543-420d-aba8-a9260fe3a28bn%40googlegroups.com.


Re: [tesseract-ocr] pytesseract having high accuracy but performing very very slow

2021-03-25 Thread Zdenko Podobny
1 000 000 pages in one pdf? Seriously?
+ Post your code. pytesseract is not effective tool in case of multiple
images (disk IO for each run/page)

Zdenko


št 25. 3. 2021 o 8:49 Vidya Chitragar <
vidya.chitra...@lucidatechnologies.com> napísal(a):

> Hi Every one.
> I am using pytesseract with tesseract-ocr version 3.05.02 for conversion
> of scanned pdf document of 1000k pages to searchable pdf document but my
> code is taking more than 5 to 6 hrs to give searcable pdf document , Any
> suggestions are very helpful to me
> Thanks,
> Vidya
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/8f2fe788-c28f-40f7-9804-99978cb44353n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yfDieVTBqLtngKSgHAY3giX5rYxmvC8S_0sDro9bgmjg%40mail.gmail.com.


Re: [tesseract-ocr] Detecting language automatically

2021-03-25 Thread Charles Cho
Hi,

I have investigated on trying to detect language automatically.
I referred to these links. Thank you, Merlijin.
https://archive.org/services/docs/api/ocr.html#autonomous-mode
https://git.archive.org/www/tesseract/-/blob/master/main.py#L757

So in my analysis, it used OSD of tesseract engine to detect layout and 
script.
After detect script, it detects languages on the script.

So I tried to use OSD engine mode based on textfairy which is Android OCR 
app based on tesseract 4.1.1.
But it doesn't work and I can't make sure how I can use OSD engine mode in 
Android.
I set 'osd' as language option string and used osd.traindata and set 
'OEM_OSD_ONLY' as engine mode.
But it doesn't work.

Hope anyone can help you to use OSD engine mode in Android.

Thank you.
Best,
Charles.

On Monday, March 22, 2021 at 10:28:38 AM UTC+8 Charles Cho wrote:

> Hi, Merlijn.
>
> Thanks for your kind response.
>
> Regarding autonomous mode, I'm trying to find such module for Android.
> But I found nothing. I will try more.
>
> >I am not sure what you're finding on google play store, but I have found
> >there to be no limitation to the amount of languages that can be used
> >during OCR. Keep in mind that using more languages will slow down the
> >OCR process.
> It's textfairy, open source app.
> https://play.google.com/store/apps/details?id=com.renard.ocr
>
> Your response is really helpful.
>
> Best,
> Charles.
> On Sunday, March 21, 2021 at 8:29:13 AM UTC+8 Merlijn Wajer wrote:
>
>> Hi, 
>>
>> On 19/03/2021 10:11, Charles Cho wrote: 
>> > Hello, 
>> > I'm working on a ocr android app based on tesseract. 
>> > I want to add feature that detects language automatically and recognize 
>> > at least 2 languages at once. 
>> > I have investigated on that for a while so I know that I have to 
>> specify 
>> > language for tesseract. 
>> > Then how can I implement auto detection of language? 
>>
>> Not exactly a mobile use case, but you can read how the Internet Archive 
>> does this (I coined it "autonomous mode", where the software just 
>> figures out the scripts and languages): 
>>
>> https://archive.org/services/docs/api/ocr.html#autonomous-mode 
>>
>> And the code is available, here (I plan to split out the archive.org 
>> specific code from the python code that invokes Tesseract and performs 
>> heuristics like script detection): 
>>
>> https://git.archive.org/www/tesseract/-/blob/master/main.py#L757 
>>
>> the tl;dr is to first perform script detection, and use the detected 
>> script to OCR the page - then use language detection libraries to guess 
>> the languages on the page. 
>>
>> > And tesseract on google play store can recognize 3 languages at once. 
>> > Is it maximum? 
>>
>> I am not sure what you're finding on google play store, but I have found 
>> there to be no limitation to the amount of languages that can be used 
>> during OCR. Keep in mind that using more languages will slow down the 
>> OCR process. 
>>
>> > Any help and advice would be really appreciated. 
>>
>> Hope this helps. 
>>
>> Cheers, 
>> Merlijn 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f05cb3fa-b7da-491f-930b-127e5784abc5n%40googlegroups.com.


Re: [tesseract-ocr] pytesseract having high accuracy but performing very very slow

2021-03-25 Thread Shree Devi Kumar
Try with newer version of tesseract.

On Thu, Mar 25, 2021, 13:19 Vidya Chitragar <
vidya.chitra...@lucidatechnologies.com> wrote:

> Hi Every one.
> I am using pytesseract with tesseract-ocr version 3.05.02 for conversion
> of scanned pdf document of 1000k pages to searchable pdf document but my
> code is taking more than 5 to 6 hrs to give searcable pdf document , Any
> suggestions are very helpful to me
> Thanks,
> Vidya
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/8f2fe788-c28f-40f7-9804-99978cb44353n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVXDMPTqKiVMFnednGaT5aBBTD8XJmQvYsuh78uNsjK2g%40mail.gmail.com.


Re: [tesseract-ocr] Pytesseract processing images already in memory

2021-03-25 Thread Lorenzo Bolzani
Try tesserocr, a real binding library.


Bye

Lorenzo

Il giorno gio 25 mar 2021 alle ore 05:44 Alex Zetaeffesse 
ha scritto:

> Hi all,
>
> I'm already using a python library (pyvips) for cropping images with text
> inside.
> Is there a way to have Pytesseract process images in memory without the
> burden of writing them to disk and then load them again with
>
> print(pytesseract.image_to_string(Image.open('test.png')))
>
> ?
>
> Thanks,
>
> Alex
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/186dcebf-cc23-4f14-a16f-aa291928d5a8n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLy8wTtMArOM7SYQ-UqFBHKtkKXXTa%2BzDeH-O4zkK7ZqVA%40mail.gmail.com.


[tesseract-ocr] pytesseract having high accuracy but performing very very slow

2021-03-25 Thread Vidya Chitragar
Hi Every one.
I am using pytesseract with tesseract-ocr version 3.05.02 for conversion of 
scanned pdf document of 1000k pages to searchable pdf document but my code 
is taking more than 5 to 6 hrs to give searcable pdf document , Any 
suggestions are very helpful to me
Thanks,
Vidya

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8f2fe788-c28f-40f7-9804-99978cb44353n%40googlegroups.com.