OCR of source code with tesseract is a problem: - tesseract is not focused on keeping spaces/indentation - you have to reconstruct it by yourself (e.g. by parsing horcr output) - tesseract is focused more on "real" text, while source code is more symbolic with a lot of extra character, case sensitive etc. So I am quite sure you will need to correct the tesseract output manually.
Zdenko po 22. 11. 2021 o 6:54 J S <[email protected]> napĂsal(a): > Hi all, > I am trying to OCR some code wrote in Python. I ve read the Tesseract doc > many times and applied 3 pre processing script with Image Magick. The > result image is attached. > I then send it to Tesseract with ```--psm 4``` which seems to be the more > adapted segmentation mode for what I am trying to do. The result is quite > ok but I don't have indentations and I think it could be still improved. > > I would be glad to have some adivce to improve the result. Thanks a lot > > Best, > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/c07b4f66-7e6e-4634-a4ee-b8a8db003f20n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/c07b4f66-7e6e-4634-a4ee-b8a8db003f20n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wEeKskfWWOZxTu%3DpmT-chCnhs_PuKKQnLzDR4GcY%3DP2g%40mail.gmail.com.

