Re: [tesseract-ocr] Re: How to improve ocr reader?

2020-03-28 Thread Teo
Ok thanks, I'll keep this.

Il giorno sabato 28 marzo 2020 19:24:12 UTC+1, Lorenzo Blz ha scritto:
>
> If you'd like to improve the OCR accuracy too a simple contrast 
> enhancement (with a simple S shaped curve) and a little sharpening helps 
> with the left border. See the attached file.
>
>
>
> Lorenzo
>
> Il giorno sab 28 mar 2020 alle ore 19:04 Essam Zaky  > ha scritto:
>
>> Yes with the same command the result attached
>>
>>
>> بتاريخ السبت، 28 مارس، 2020 7:55:05 م UTC+2، كتب Teo:
>>>
>>> With the same coomand?
>>> tesseract pho.png pho-eng -l eng pdf
>>>
>>>
>>>
>>> Il giorno sabato 28 marzo 2020 18:48:17 UTC+1, Essam Zaky ha scritto:

 It works fine in my machine
 It seems it's problem in your pdf viewer
 i used Adobe PDF reader V9.0

 there are some pdf readers fail to read serachable pdf , try to check 
 another reader

 Best Regards
 Essam

 بتاريخ السبت، 28 مارس، 2020 7:34:59 م UTC+2، كتب Teo:
>
>
> Ok
> Il giorno sabato 28 marzo 2020 18:32:26 UTC+1, Essam Zaky ha scritto:
>>
>> PLease attach the original image to check on my machine
>>
>> بتاريخ السبت، 28 مارس، 2020 7:24:07 م UTC+2، كتب Teo:
>>>
>>> Thanks for the reply. 
>>> I just opened an issue on github/Tesseract. Then I tried to create 
>>> an pdf only with tesseract and without gimagereader with: 
>>> tesseract pho.png pho-eng -l eng pdf
>>> but this is the result...
>>>
>>>
>>> Il giorno venerdì 27 marzo 2020 03:13:40 UTC+1, Essam Zaky ha 
>>> scritto:

 So I guess the error in PDF generation module
 you have one of the following option
 -try to enhance the bug by your self
 -raise an issue in Tesseract issues , but check first that the 
 issue is not exist in list of issues
 -Use other extrenal library to create searchable pdf depending on 
 hocr

 before tesseract add feature of generating pdf i used library 
 called itextsharp to generate  the pdf and the result was very good 
 for me

 بتاريخ الخميس، 26 مارس، 2020 10:54:50 م UTC+2، كتب Teo:
>
> Ok coordinates seem correct.
>
> Il giorno giovedì 26 marzo 2020 19:13:52 UTC+1, Essam Zaky ha 
> scritto:
>>
>> read this document
>> https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage
>>
>> the following command can return the coordinates
>>
>> tesseract testing/eurotext.png testing/eurotext-eng -l eng hocr
>>
>>
>> hocr contain the word as a text and coordinate
>> you can open the image in any image editor such as MSpaint and 
>> check the returned coordinates represent the word in images
>>
>> Best Regards
>>
>> بتاريخ الخميس، 26 مارس، 2020 1:10:22 م UTC+2، كتب Teo:
>>>
>>> Thanks for your help. how can i get the coordinates, and how do 
>>> i check if they are correct?
>>>
>>> Il giorno mercoledì 25 marzo 2020 10:41:07 UTC+1, Essam Zaky ha 
>>> scritto:

 You need now to check the coordinates returned from tesseract 
 ,use hocr output and check if words coordinates are returned 
 correctly if 
 yes so it is a bug in pdf generation

 if the coordinates are wrong it's bug in tesseract 

 for me i used before library called itextsharp to generate 
 searchable pdf , the library  ported from itext java library , it 
 gives 
 good pdf output


 بتاريخ الأربعاء، 25 مارس، 2020 11:25:46 ص UTC+2، كتب Teo:
>
> Ok I think that it's  a pdf generation module, because the txt 
> is almost the same with the exception of some "the" which 
> tesseract sees as 
> "thè".
>
> Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky 
> ha scritto:
>>
>> You need to know which to improve tesserct  engine or PDF 
>> generation
>>
>> so compare text file from abby and tesserct 
>> if the result is highly different you need to improve image 
>> quality or improve LSTM 
>>
>> if the result of tesseract is good so you need to enhance the 
>> PDF generation module
>>
>> بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo:
>>>
>>> The quality is already very good, but is lower than abby 
>>> finereader. In attachment there is a comparison between abby 
>>> and 
>>> gimagereader ocr, and you can see the difference. How we 
>>> can improve it?
>>>
>>>

[tesseract-ocr] Re: How to improve ocr reader?

2020-03-28 Thread Teo
Ok thanks a lot.

Il giorno sabato 28 marzo 2020 19:04:25 UTC+1, Essam Zaky ha scritto:
>
> Yes with the same command the result attached
>
>
> بتاريخ السبت، 28 مارس، 2020 7:55:05 م UTC+2، كتب Teo:
>>
>> With the same coomand?
>> tesseract pho.png pho-eng -l eng pdf
>>
>>
>>
>> Il giorno sabato 28 marzo 2020 18:48:17 UTC+1, Essam Zaky ha scritto:
>>>
>>> It works fine in my machine
>>> It seems it's problem in your pdf viewer
>>> i used Adobe PDF reader V9.0
>>>
>>> there are some pdf readers fail to read serachable pdf , try to check 
>>> another reader
>>>
>>> Best Regards
>>> Essam
>>>
>>> بتاريخ السبت، 28 مارس، 2020 7:34:59 م UTC+2، كتب Teo:


 Ok
 Il giorno sabato 28 marzo 2020 18:32:26 UTC+1, Essam Zaky ha scritto:
>
> PLease attach the original image to check on my machine
>
> بتاريخ السبت، 28 مارس، 2020 7:24:07 م UTC+2، كتب Teo:
>>
>> Thanks for the reply. 
>> I just opened an issue on github/Tesseract. Then I tried to create an 
>> pdf only with tesseract and without gimagereader with: 
>> tesseract pho.png pho-eng -l eng pdf
>> but this is the result...
>>
>>
>> Il giorno venerdì 27 marzo 2020 03:13:40 UTC+1, Essam Zaky ha scritto:
>>>
>>> So I guess the error in PDF generation module
>>> you have one of the following option
>>> -try to enhance the bug by your self
>>> -raise an issue in Tesseract issues , but check first that the issue 
>>> is not exist in list of issues
>>> -Use other extrenal library to create searchable pdf depending on 
>>> hocr
>>>
>>> before tesseract add feature of generating pdf i used library called 
>>> itextsharp to generate  the pdf and the result was very good for me
>>>
>>> بتاريخ الخميس، 26 مارس، 2020 10:54:50 م UTC+2، كتب Teo:

 Ok coordinates seem correct.

 Il giorno giovedì 26 marzo 2020 19:13:52 UTC+1, Essam Zaky ha 
 scritto:
>
> read this document
> https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage
>
> the following command can return the coordinates
>
> tesseract testing/eurotext.png testing/eurotext-eng -l eng hocr
>
>
> hocr contain the word as a text and coordinate
> you can open the image in any image editor such as MSpaint and 
> check the returned coordinates represent the word in images
>
> Best Regards
>
> بتاريخ الخميس، 26 مارس، 2020 1:10:22 م UTC+2، كتب Teo:
>>
>> Thanks for your help. how can i get the coordinates, and how do i 
>> check if they are correct?
>>
>> Il giorno mercoledì 25 marzo 2020 10:41:07 UTC+1, Essam Zaky ha 
>> scritto:
>>>
>>> You need now to check the coordinates returned from tesseract 
>>> ,use hocr output and check if words coordinates are returned 
>>> correctly if 
>>> yes so it is a bug in pdf generation
>>>
>>> if the coordinates are wrong it's bug in tesseract 
>>>
>>> for me i used before library called itextsharp to generate 
>>> searchable pdf , the library  ported from itext java library , it 
>>> gives 
>>> good pdf output
>>>
>>>
>>> بتاريخ الأربعاء، 25 مارس، 2020 11:25:46 ص UTC+2، كتب Teo:

 Ok I think that it's  a pdf generation module, because the txt 
 is almost the same with the exception of some "the" which 
 tesseract sees as 
 "thè".

 Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky ha 
 scritto:
>
> You need to know which to improve tesserct  engine or PDF 
> generation
>
> so compare text file from abby and tesserct 
> if the result is highly different you need to improve image 
> quality or improve LSTM 
>
> if the result of tesseract is good so you need to enhance the 
> PDF generation module
>
> بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo:
>>
>> The quality is already very good, but is lower than abby 
>> finereader. In attachment there is a comparison between abby and 
>> gimagereader ocr, and you can see the difference. How we can 
>> improve it?
>>
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5116c498-15c8-4090-b125-1c30579c54f2%40googlegroups.com.


[tesseract-ocr] Re: How to improve ocr reader?

2020-03-28 Thread Teo
With the same coomand?
tesseract pho.png pho-eng -l eng pdf



Il giorno sabato 28 marzo 2020 18:48:17 UTC+1, Essam Zaky ha scritto:
>
> It works fine in my machine
> It seems it's problem in your pdf viewer
> i used Adobe PDF reader V9.0
>
> there are some pdf readers fail to read serachable pdf , try to check 
> another reader
>
> Best Regards
> Essam
>
> بتاريخ السبت، 28 مارس، 2020 7:34:59 م UTC+2، كتب Teo:
>>
>>
>> Ok
>> Il giorno sabato 28 marzo 2020 18:32:26 UTC+1, Essam Zaky ha scritto:
>>>
>>> PLease attach the original image to check on my machine
>>>
>>> بتاريخ السبت، 28 مارس، 2020 7:24:07 م UTC+2، كتب Teo:

 Thanks for the reply. 
 I just opened an issue on github/Tesseract. Then I tried to create an 
 pdf only with tesseract and without gimagereader with: 
 tesseract pho.png pho-eng -l eng pdf
 but this is the result...


 Il giorno venerdì 27 marzo 2020 03:13:40 UTC+1, Essam Zaky ha scritto:
>
> So I guess the error in PDF generation module
> you have one of the following option
> -try to enhance the bug by your self
> -raise an issue in Tesseract issues , but check first that the issue 
> is not exist in list of issues
> -Use other extrenal library to create searchable pdf depending on hocr
>
> before tesseract add feature of generating pdf i used library called 
> itextsharp to generate  the pdf and the result was very good for me
>
> بتاريخ الخميس، 26 مارس، 2020 10:54:50 م UTC+2، كتب Teo:
>>
>> Ok coordinates seem correct.
>>
>> Il giorno giovedì 26 marzo 2020 19:13:52 UTC+1, Essam Zaky ha scritto:
>>>
>>> read this document
>>> https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage
>>>
>>> the following command can return the coordinates
>>>
>>> tesseract testing/eurotext.png testing/eurotext-eng -l eng hocr
>>>
>>>
>>> hocr contain the word as a text and coordinate
>>> you can open the image in any image editor such as MSpaint and check 
>>> the returned coordinates represent the word in images
>>>
>>> Best Regards
>>>
>>> بتاريخ الخميس، 26 مارس، 2020 1:10:22 م UTC+2، كتب Teo:

 Thanks for your help. how can i get the coordinates, and how do i 
 check if they are correct?

 Il giorno mercoledì 25 marzo 2020 10:41:07 UTC+1, Essam Zaky ha 
 scritto:
>
> You need now to check the coordinates returned from tesseract ,use 
> hocr output and check if words coordinates are returned correctly if 
> yes so 
> it is a bug in pdf generation
>
> if the coordinates are wrong it's bug in tesseract 
>
> for me i used before library called itextsharp to generate 
> searchable pdf , the library  ported from itext java library , it 
> gives 
> good pdf output
>
>
> بتاريخ الأربعاء، 25 مارس، 2020 11:25:46 ص UTC+2، كتب Teo:
>>
>> Ok I think that it's  a pdf generation module, because the txt is 
>> almost the same with the exception of some "the" which tesseract 
>> sees as 
>> "thè".
>>
>> Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky ha 
>> scritto:
>>>
>>> You need to know which to improve tesserct  engine or PDF 
>>> generation
>>>
>>> so compare text file from abby and tesserct 
>>> if the result is highly different you need to improve image 
>>> quality or improve LSTM 
>>>
>>> if the result of tesseract is good so you need to enhance the 
>>> PDF generation module
>>>
>>> بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo:

 The quality is already very good, but is lower than abby 
 finereader. In attachment there is a comparison between abby and 
 gimagereader ocr, and you can see the difference. How we can 
 improve it?





-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/de82d805-d1e2-44e1-aef2-4bab79eadd21%40googlegroups.com.


[tesseract-ocr] Re: How to improve ocr reader?

2020-03-28 Thread Essam Zaky
It works fine in my machine
It seems it's problem in your pdf viewer
i used Adobe PDF reader V9.0

there are some pdf readers fail to read serachable pdf , try to check 
another reader

Best Regards
Essam

بتاريخ السبت، 28 مارس، 2020 7:34:59 م UTC+2، كتب Teo:
>
>
> Ok
> Il giorno sabato 28 marzo 2020 18:32:26 UTC+1, Essam Zaky ha scritto:
>>
>> PLease attach the original image to check on my machine
>>
>> بتاريخ السبت، 28 مارس، 2020 7:24:07 م UTC+2، كتب Teo:
>>>
>>> Thanks for the reply. 
>>> I just opened an issue on github/Tesseract. Then I tried to create an 
>>> pdf only with tesseract and without gimagereader with: 
>>> tesseract pho.png pho-eng -l eng pdf
>>> but this is the result...
>>>
>>>
>>> Il giorno venerdì 27 marzo 2020 03:13:40 UTC+1, Essam Zaky ha scritto:

 So I guess the error in PDF generation module
 you have one of the following option
 -try to enhance the bug by your self
 -raise an issue in Tesseract issues , but check first that the issue is 
 not exist in list of issues
 -Use other extrenal library to create searchable pdf depending on hocr

 before tesseract add feature of generating pdf i used library called 
 itextsharp to generate  the pdf and the result was very good for me

 بتاريخ الخميس، 26 مارس، 2020 10:54:50 م UTC+2، كتب Teo:
>
> Ok coordinates seem correct.
>
> Il giorno giovedì 26 marzo 2020 19:13:52 UTC+1, Essam Zaky ha scritto:
>>
>> read this document
>> https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage
>>
>> the following command can return the coordinates
>>
>> tesseract testing/eurotext.png testing/eurotext-eng -l eng hocr
>>
>>
>> hocr contain the word as a text and coordinate
>> you can open the image in any image editor such as MSpaint and check 
>> the returned coordinates represent the word in images
>>
>> Best Regards
>>
>> بتاريخ الخميس، 26 مارس، 2020 1:10:22 م UTC+2، كتب Teo:
>>>
>>> Thanks for your help. how can i get the coordinates, and how do i 
>>> check if they are correct?
>>>
>>> Il giorno mercoledì 25 marzo 2020 10:41:07 UTC+1, Essam Zaky ha 
>>> scritto:

 You need now to check the coordinates returned from tesseract ,use 
 hocr output and check if words coordinates are returned correctly if 
 yes so 
 it is a bug in pdf generation

 if the coordinates are wrong it's bug in tesseract 

 for me i used before library called itextsharp to generate 
 searchable pdf , the library  ported from itext java library , it 
 gives 
 good pdf output


 بتاريخ الأربعاء، 25 مارس، 2020 11:25:46 ص UTC+2، كتب Teo:
>
> Ok I think that it's  a pdf generation module, because the txt is 
> almost the same with the exception of some "the" which tesseract sees 
> as 
> "thè".
>
> Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky ha 
> scritto:
>>
>> You need to know which to improve tesserct  engine or PDF 
>> generation
>>
>> so compare text file from abby and tesserct 
>> if the result is highly different you need to improve image 
>> quality or improve LSTM 
>>
>> if the result of tesseract is good so you need to enhance the PDF 
>> generation module
>>
>> بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo:
>>>
>>> The quality is already very good, but is lower than abby 
>>> finereader. In attachment there is a comparison between abby and 
>>> gimagereader ocr, and you can see the difference. How we can 
>>> improve it?
>>>
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/28825e73-ab2c-4941-8a0c-cd10c4bc8e95%40googlegroups.com.


[tesseract-ocr] Re: How to improve ocr reader?

2020-03-28 Thread Essam Zaky
PLease attach the original image to check on my machine

بتاريخ السبت، 28 مارس، 2020 7:24:07 م UTC+2، كتب Teo:
>
> Thanks for the reply. 
> I just opened an issue on github/Tesseract. Then I tried to create an pdf 
> only with tesseract and without gimagereader with: 
> tesseract pho.png pho-eng -l eng pdf
> but this is the result...
>
>
> Il giorno venerdì 27 marzo 2020 03:13:40 UTC+1, Essam Zaky ha scritto:
>>
>> So I guess the error in PDF generation module
>> you have one of the following option
>> -try to enhance the bug by your self
>> -raise an issue in Tesseract issues , but check first that the issue is 
>> not exist in list of issues
>> -Use other extrenal library to create searchable pdf depending on hocr
>>
>> before tesseract add feature of generating pdf i used library called 
>> itextsharp to generate  the pdf and the result was very good for me
>>
>> بتاريخ الخميس، 26 مارس، 2020 10:54:50 م UTC+2، كتب Teo:
>>>
>>> Ok coordinates seem correct.
>>>
>>> Il giorno giovedì 26 marzo 2020 19:13:52 UTC+1, Essam Zaky ha scritto:

 read this document
 https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage

 the following command can return the coordinates

 tesseract testing/eurotext.png testing/eurotext-eng -l eng hocr


 hocr contain the word as a text and coordinate
 you can open the image in any image editor such as MSpaint and check 
 the returned coordinates represent the word in images

 Best Regards

 بتاريخ الخميس، 26 مارس، 2020 1:10:22 م UTC+2، كتب Teo:
>
> Thanks for your help. how can i get the coordinates, and how do i 
> check if they are correct?
>
> Il giorno mercoledì 25 marzo 2020 10:41:07 UTC+1, Essam Zaky ha 
> scritto:
>>
>> You need now to check the coordinates returned from tesseract ,use 
>> hocr output and check if words coordinates are returned correctly if yes 
>> so 
>> it is a bug in pdf generation
>>
>> if the coordinates are wrong it's bug in tesseract 
>>
>> for me i used before library called itextsharp to generate searchable 
>> pdf , the library  ported from itext java library , it gives good pdf 
>> output
>>
>>
>> بتاريخ الأربعاء، 25 مارس، 2020 11:25:46 ص UTC+2، كتب Teo:
>>>
>>> Ok I think that it's  a pdf generation module, because the txt is 
>>> almost the same with the exception of some "the" which tesseract sees 
>>> as 
>>> "thè".
>>>
>>> Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky ha 
>>> scritto:

 You need to know which to improve tesserct  engine or PDF generation

 so compare text file from abby and tesserct 
 if the result is highly different you need to improve image quality 
 or improve LSTM 

 if the result of tesseract is good so you need to enhance the PDF 
 generation module

 بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo:
>
> The quality is already very good, but is lower than abby 
> finereader. In attachment there is a comparison between abby and 
> gimagereader ocr, and you can see the difference. How we can 
> improve it?
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a43268d6-b716-4ecb-b591-affeaa859896%40googlegroups.com.


[tesseract-ocr] Re: How to improve ocr reader?

2020-03-26 Thread Essam Zaky
So I guess the error in PDF generation module
you have one of the following option
-try to enhance the bug by your self
-raise an issue in Tesseract issues , but check first that the issue is not 
exist in list of issues
-Use other extrenal library to create searchable pdf depending on hocr

before tesseract add feature of generating pdf i used library called 
itextsharp to generate  the pdf and the result was very good for me

بتاريخ الخميس، 26 مارس، 2020 10:54:50 م UTC+2، كتب Teo:
>
> Ok coordinates seem correct.
>
> Il giorno giovedì 26 marzo 2020 19:13:52 UTC+1, Essam Zaky ha scritto:
>>
>> read this document
>> https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage
>>
>> the following command can return the coordinates
>>
>> tesseract testing/eurotext.png testing/eurotext-eng -l eng hocr
>>
>>
>> hocr contain the word as a text and coordinate
>> you can open the image in any image editor such as MSpaint and check the 
>> returned coordinates represent the word in images
>>
>> Best Regards
>>
>> بتاريخ الخميس، 26 مارس، 2020 1:10:22 م UTC+2، كتب Teo:
>>>
>>> Thanks for your help. how can i get the coordinates, and how do i check 
>>> if they are correct?
>>>
>>> Il giorno mercoledì 25 marzo 2020 10:41:07 UTC+1, Essam Zaky ha scritto:

 You need now to check the coordinates returned from tesseract ,use hocr 
 output and check if words coordinates are returned correctly if yes so it 
 is a bug in pdf generation

 if the coordinates are wrong it's bug in tesseract 

 for me i used before library called itextsharp to generate searchable 
 pdf , the library  ported from itext java library , it gives good pdf 
 output


 بتاريخ الأربعاء، 25 مارس، 2020 11:25:46 ص UTC+2، كتب Teo:
>
> Ok I think that it's  a pdf generation module, because the txt is 
> almost the same with the exception of some "the" which tesseract sees as 
> "thè".
>
> Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky ha 
> scritto:
>>
>> You need to know which to improve tesserct  engine or PDF generation
>>
>> so compare text file from abby and tesserct 
>> if the result is highly different you need to improve image quality 
>> or improve LSTM 
>>
>> if the result of tesseract is good so you need to enhance the PDF 
>> generation module
>>
>> بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo:
>>>
>>> The quality is already very good, but is lower than abby finereader. 
>>> In attachment there is a comparison between abby and gimagereader ocr, 
>>> and 
>>> you can see the difference. How we can improve it?
>>>
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9a423ce2-e982-437d-b106-29f61765a4c0%40googlegroups.com.


[tesseract-ocr] Re: How to improve ocr reader?

2020-03-26 Thread Teo
Ok coordinates seem correct.

Il giorno giovedì 26 marzo 2020 19:13:52 UTC+1, Essam Zaky ha scritto:
>
> read this document
> https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage
>
> the following command can return the coordinates
>
> tesseract testing/eurotext.png testing/eurotext-eng -l eng hocr
>
>
> hocr contain the word as a text and coordinate
> you can open the image in any image editor such as MSpaint and check the 
> returned coordinates represent the word in images
>
> Best Regards
>
> بتاريخ الخميس، 26 مارس، 2020 1:10:22 م UTC+2، كتب Teo:
>>
>> Thanks for your help. how can i get the coordinates, and how do i check 
>> if they are correct?
>>
>> Il giorno mercoledì 25 marzo 2020 10:41:07 UTC+1, Essam Zaky ha scritto:
>>>
>>> You need now to check the coordinates returned from tesseract ,use hocr 
>>> output and check if words coordinates are returned correctly if yes so it 
>>> is a bug in pdf generation
>>>
>>> if the coordinates are wrong it's bug in tesseract 
>>>
>>> for me i used before library called itextsharp to generate searchable 
>>> pdf , the library  ported from itext java library , it gives good pdf output
>>>
>>>
>>> بتاريخ الأربعاء، 25 مارس، 2020 11:25:46 ص UTC+2، كتب Teo:

 Ok I think that it's  a pdf generation module, because the txt is 
 almost the same with the exception of some "the" which tesseract sees as 
 "thè".

 Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky ha scritto:
>
> You need to know which to improve tesserct  engine or PDF generation
>
> so compare text file from abby and tesserct 
> if the result is highly different you need to improve image quality or 
> improve LSTM 
>
> if the result of tesseract is good so you need to enhance the PDF 
> generation module
>
> بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo:
>>
>> The quality is already very good, but is lower than abby finereader. 
>> In attachment there is a comparison between abby and gimagereader ocr, 
>> and 
>> you can see the difference. How we can improve it?
>>
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6e127b74-c57f-4b79-94bd-e766d254f2cd%40googlegroups.com.
   
 
  
  
  
   
  
   
 
  
  

The main topics of theoretical computer science are taught in most computer
Science and engineering curricula, but are not presented as a foundation for
omputer studies. Most courses—and their reference textbooks—are highly
sed in their choice of topics. Very often they overemphasize traditional areas
such as formal languages and automata—and pay little or no attention to
yer important topics—such as formal semantics or computational complexity.
The organization of this book results from our strongly held belief that
oretical computer science should be viewed as the cornerstone of computer
ence and engineering curricula. Computer specialists, in their everyday life,
must be able to translate actual problems into abstractions based on the use of
ormal models, to manipulate such formal descriptions, and to reason about their
_ Properties in a rigorous way. This very special attitude differentiates the 
com-
puter specialist from most other technical professionals.
For these reasons, we suggest that an exposure to theoretical computer
science topics should be given in the early stage of computer science education,
particularly at the undergraduate level. Theoretical topics should not be 
viewed as
options that can be added late in the curricula. Rather, they must be viewed as
ee Wa Viieel oc

  
  


[tesseract-ocr] Re: How to improve ocr reader?

2020-03-26 Thread Essam Zaky
read this document
https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage

the following command can return the coordinates

tesseract testing/eurotext.png testing/eurotext-eng -l eng hocr


hocr contain the word as a text and coordinate
you can open the image in any image editor such as MSpaint and check the 
returned coordinates represent the word in images

Best Regards

بتاريخ الخميس، 26 مارس، 2020 1:10:22 م UTC+2، كتب Teo:
>
> Thanks for your help. how can i get the coordinates, and how do i check if 
> they are correct?
>
> Il giorno mercoledì 25 marzo 2020 10:41:07 UTC+1, Essam Zaky ha scritto:
>>
>> You need now to check the coordinates returned from tesseract ,use hocr 
>> output and check if words coordinates are returned correctly if yes so it 
>> is a bug in pdf generation
>>
>> if the coordinates are wrong it's bug in tesseract 
>>
>> for me i used before library called itextsharp to generate searchable pdf 
>> , the library  ported from itext java library , it gives good pdf output
>>
>>
>> بتاريخ الأربعاء، 25 مارس، 2020 11:25:46 ص UTC+2، كتب Teo:
>>>
>>> Ok I think that it's  a pdf generation module, because the txt is almost 
>>> the same with the exception of some "the" which tesseract sees as "thè".
>>>
>>> Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky ha scritto:

 You need to know which to improve tesserct  engine or PDF generation

 so compare text file from abby and tesserct 
 if the result is highly different you need to improve image quality or 
 improve LSTM 

 if the result of tesseract is good so you need to enhance the PDF 
 generation module

 بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo:
>
> The quality is already very good, but is lower than abby finereader. 
> In attachment there is a comparison between abby and gimagereader ocr, 
> and 
> you can see the difference. How we can improve it?
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/cae9a132-fb12-4512-bc3f-79c2d948a615%40googlegroups.com.


[tesseract-ocr] Re: How to improve ocr reader?

2020-03-26 Thread Teo
Thanks for your help. how can i get the coordinates, and how do i check if 
they are correct?

Il giorno mercoledì 25 marzo 2020 10:41:07 UTC+1, Essam Zaky ha scritto:
>
> You need now to check the coordinates returned from tesseract ,use hocr 
> output and check if words coordinates are returned correctly if yes so it 
> is a bug in pdf generation
>
> if the coordinates are wrong it's bug in tesseract 
>
> for me i used before library called itextsharp to generate searchable pdf 
> , the library  ported from itext java library , it gives good pdf output
>
>
> بتاريخ الأربعاء، 25 مارس، 2020 11:25:46 ص UTC+2، كتب Teo:
>>
>> Ok I think that it's  a pdf generation module, because the txt is almost 
>> the same with the exception of some "the" which tesseract sees as "thè".
>>
>> Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky ha scritto:
>>>
>>> You need to know which to improve tesserct  engine or PDF generation
>>>
>>> so compare text file from abby and tesserct 
>>> if the result is highly different you need to improve image quality or 
>>> improve LSTM 
>>>
>>> if the result of tesseract is good so you need to enhance the PDF 
>>> generation module
>>>
>>> بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo:

 The quality is already very good, but is lower than abby finereader. In 
 attachment there is a comparison between abby and gimagereader ocr, and 
 you 
 can see the difference. How we can improve it?





-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b3293cc3-4766-4020-85b5-de6ad282aa6c%40googlegroups.com.


[tesseract-ocr] Re: How to improve ocr reader?

2020-03-25 Thread Essam Zaky
You need now to check the coordinates returned from tesseract ,use hocr 
output and check if words coordinates are returned correctly if yes so it 
is a bug in pdf generation

if the coordinates are wrong it's bug in tesseract 

for me i used before library called itextsharp to generate searchable pdf , 
the library  ported from itext java library , it gives good pdf output


بتاريخ الأربعاء، 25 مارس، 2020 11:25:46 ص UTC+2، كتب Teo:
>
> Ok I think that it's  a pdf generation module, because the txt is almost 
> the same with the exception of some "the" which tesseract sees as "thè".
>
> Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky ha scritto:
>>
>> You need to know which to improve tesserct  engine or PDF generation
>>
>> so compare text file from abby and tesserct 
>> if the result is highly different you need to improve image quality or 
>> improve LSTM 
>>
>> if the result of tesseract is good so you need to enhance the PDF 
>> generation module
>>
>> بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo:
>>>
>>> The quality is already very good, but is lower than abby finereader. In 
>>> attachment there is a comparison between abby and gimagereader ocr, and you 
>>> can see the difference. How we can improve it?
>>>
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a4934f98-f1bc-4fcf-9bc1-c4805c143094%40googlegroups.com.


[tesseract-ocr] Re: How to improve ocr reader?

2020-03-25 Thread Teo
I discovered that the problem is not with reading, but with exporting to 
pdf. As I have tried to save both readings as txt files and they are almost 
the same. So how can I make the export more like abby's? With the text 
precisely on the document, all aligned I mean ..

Il giorno mercoledì 25 marzo 2020 10:25:46 UTC+1, Teo ha scritto:
>
> Ok I think that it's  a pdf generation module, because the txt is almost 
> the same with the exception of some "the" which tesseract sees as "thè".
>
> Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky ha scritto:
>>
>> You need to know which to improve tesserct  engine or PDF generation
>>
>> so compare text file from abby and tesserct 
>> if the result is highly different you need to improve image quality or 
>> improve LSTM 
>>
>> if the result of tesseract is good so you need to enhance the PDF 
>> generation module
>>
>> بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo:
>>>
>>> The quality is already very good, but is lower than abby finereader. In 
>>> attachment there is a comparison between abby and gimagereader ocr, and you 
>>> can see the difference. How we can improve it?
>>>
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/06e4a583-3b9a-48e6-95ca-7591f77ad615%40googlegroups.com.


[tesseract-ocr] Re: How to improve ocr reader?

2020-03-25 Thread Teo
Ok I think that it's  a pdf generation module, because the txt is almost 
the same with the exception of some "the" which tesseract sees as "thè".

Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky ha scritto:
>
> You need to know which to improve tesserct  engine or PDF generation
>
> so compare text file from abby and tesserct 
> if the result is highly different you need to improve image quality or 
> improve LSTM 
>
> if the result of tesseract is good so you need to enhance the PDF 
> generation module
>
> بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo:
>>
>> The quality is already very good, but is lower than abby finereader. In 
>> attachment there is a comparison between abby and gimagereader ocr, and you 
>> can see the difference. How we can improve it?
>>
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f0f76fd5-51fe-4b65-af63-04ba1bcebd97%40googlegroups.com.


[tesseract-ocr] Re: How to improve ocr reader?

2020-03-25 Thread Essam Zaky
You need to know which to improve tesserct  engine or PDF generation

so compare text file from abby and tesserct 
if the result is highly different you need to improve image quality or 
improve LSTM 

if the result of tesseract is good so you need to enhance the PDF 
generation module

بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo:
>
> The quality is already very good, but is lower than abby finereader. In 
> attachment there is a comparison between abby and gimagereader ocr, and you 
> can see the difference. How we can improve it?
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4e5af415-9b61-4f02-bd55-1e2a865987b7%40googlegroups.com.