[tesseract-ocr] Re: Inconsistencies in detection and extraction of text using tesseract

Sundara Ganesh Mon, 17 Jun 2024 21:50:59 -0700

You said: * Now I am trying to save it in a CSV. For that, I am using the 
coordinates of the detected text and reconstructing the table structure. *


So, I assume, you identified the columns based on the coordinates.  If so, 
you know that the words before the text of the second column belong to the 
first column and you should club them together and surround with quotes 
followed by a single comma (not commas after every word).  BTW, when you 
open the resultant csv file in spreadsheet, you may have to resize the 
first column to see the long text of words in it.

I hope I didn't misread your question to give you this obvious answer.
On Monday, June 3, 2024 at 6:51:26 AM UTC-7 Saanvi Bhagat wrote:

> Thank you so much for your help!! Using interpolation improved my results 
> to a great extent. I would like one more suggestion from you. I have 
> extracted the text from the table in the image. Now I am trying to save it 
> in a CSV. For that, I am using the coordinates of the detected text and 
> reconstructing the table structure. 
> I am providing the input image and the screenshot of the resultant output 
> in the CSV file. As it can be seen in the output_in_csv image, the facts 
> and figures are being saved correctly, however, the first column is very 
> absurd. A new column is being generated for each word. That might be 
> because tesseract detects the text word by word and hence creates a new 
> column for each word. Could you please suggest a way to optimize my 
> results? (majorly the first column)
> The main issues are repetition in the column values and a new column being 
> created for each word rather than just 1 column.  
> On Saturday, June 1, 2024 at 11:21:17 AM UTC+5:30 [email protected] 
> wrote:
>
>> Try to resize the image increase it size, use interpolation with 
>> inter_area or inter_cubic the bigger the image the better tesseract 
>> perform. PSM 6 is the right setting
>>
>> On Saturday 1 June 2024 at 00:19:32 UTC+12 [email protected] wrote:
>>
>>>
>>> In order to improve the results, I have implemented canny edge detection 
>>> and Hough Lines Transform on the images. Then I fed the binarized image to 
>>> the tesseract model.
>>>
>>> text = pytesseract.image_to_string(cropped_frame,lang='eng', config =' 
>>> --psm 6 --oem 3')
>>> The results have improved a bit, but are still far from perfect. The 
>>> negative symbols are being omitted, some of them are being misunderstood as 
>>> ~. Similarly some decimal points are also being omitted. 22.5 was extracted 
>>> as 225.
>>> On Friday, May 31, 2024 at 1:07:01 PM UTC+5:30 [email protected] wrote:
>>>
>>>> Its hard to give opinion withour seeing how you setup tesseract, what 
>>>> PSM did you specify, .. etc?
>>>>
>>>> On Friday 31 May 2024 at 02:34:36 UTC+12 [email protected] wrote:
>>>>
>>>>> I have provided the image from which I am trying to extract text from, 
>>>>> using tesseract ocr (input.jpeg). Along with that, I have also provided 
>>>>> the 
>>>>> result or the extracted text from the image. As it can be observed from 
>>>>> the 
>>>>> images, the extracted text is not very accurate. Negative symbols have 
>>>>> been 
>>>>> omitted, some undesired characters are also there in the extracted text. 
>>>>> (I 
>>>>> have marked some of the incorrect results with blue boxes)
>>>>>
>>>>> I have tried to improve the results by preprocessing and bringing 
>>>>> changes in the parameters of the model. I have tried:
>>>>>
>>>>> 1. Binarizing the images
>>>>>
>>>>> 2. HDR processing of the processes
>>>>>
>>>>> Even then, such inconsistencies remain.
>>>>>
>>>>> How to improve the detection and extraction of text in tesseract? I 
>>>>> have also tried paddleocr for the same task. Even then, symbols such as 
>>>>> euro, some negative signs are not being detected.
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1d76cb9a-fc50-44ec-931e-21fc3e03d8e6n%40googlegroups.com.

[tesseract-ocr] Re: Inconsistencies in detection and extraction of text using tesseract

Reply via email to