Hello everyone,
I'm using Tesseract in VB.Net with
hOcr2Pdf.NET <https://hocrtopdf.codeplex.com/>
to write an underlay text with OCR Data and mount a searchable pdf.
Tesseract is recognizing the text well, My problem is that the underlay
text is in the wrong position as you can see in the image attached.
Anyone already had that problem?
I'm passing the HTML generated by the sub Tesseract.GetHOCRText to the
hDocument of HOcr2Pdf.Net but seems like the positions and sizes are wrong.
My code to create the pdf
With tesseract.Process(currentPageImage)
OCRParser.ParseHOCR(hdoc, .GetHOCRText(0, True), True)
pdfCreator.AddPage(hdoc.Pages(hdoc.Pages.Count - 1),
currentPageImage)
hdoc.Pages.RemoveAt(hdoc.Pages.Count - 1)
.Dispose()
End With
pdfCreator.SaveAndClose()
this OCRParser class is the same class Parser of hOcr2Pdf.Net but that
class is in a private namespace and I can't access.
I did this because to add a new HTML page to hDocument you need to pass a
path of a HTML file and I don't want to save the tesseract output just to
pass as an argument.
Doing this way I changed the Parser class to get the HTML object from text
and not from a file, now I can pass the HTML text instead of a path of a
HTML file.
Can my problem be something related with tesseract training? is it
recognizing the wrong font size or something like that?
I'm using the Default english trained data, If I made my own trained data
with my samples should the Underlay text be created in the right
size/position?
Many thanks!
Edson Luis Moretti.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/b53dfa70-ee0f-422a-8b72-1c57e574a30a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.