Hello Tom,

I'm sorry, I forgot to put the verion in my post!
I'm using the version 3.0.2.0
This is the latest version that we have a wrapper for .Net

Em quinta-feira, 10 de março de 2016 13:03:17 UTC, Edson Luis Moretti 
escreveu:
>
> Hello everyone,
>
> I'm using Tesseract in VB.Net with 
> hOcr2Pdf.NET <https://hocrtopdf.codeplex.com/> 
> to write an underlay text with OCR Data and mount a searchable pdf.
>
> Tesseract is recognizing the text well, My problem is that the underlay 
> text is in the wrong position as you can see in the image attached.
>
> Anyone already had that problem? 
>
> I'm passing the HTML generated by the sub Tesseract.GetHOCRText to the 
> hDocument of HOcr2Pdf.Net but seems like the positions and sizes are wrong.
>
> My code to create the pdf
>             With tesseract.Process(currentPageImage)
>                 OCRParser.ParseHOCR(hdoc, .GetHOCRText(0, True), True)
>                 pdfCreator.AddPage(hdoc.Pages(hdoc.Pages.Count - 1), 
> currentPageImage)
>                 hdoc.Pages.RemoveAt(hdoc.Pages.Count - 1)
>
>
>                 .Dispose()
>             End With
>             pdfCreator.SaveAndClose()
> this OCRParser class is the same class Parser of hOcr2Pdf.Net but that 
> class is in a private namespace and I can't access. 
> I did this because to add a new HTML page to hDocument you need to pass a 
> path of a HTML file and I don't want to save the tesseract output just to 
> pass as an argument.
> Doing this way I changed the Parser class to get the HTML object from text 
> and not from a file, now I can pass the HTML text instead of a path of a 
> HTML file.
>
> Can my problem be something related with tesseract training? is it 
> recognizing the wrong font size or something like that?
>
> I'm using the Default english trained data, If I made my own trained data 
> with my samples should the Underlay text be created in the right 
> size/position?
>
> Many thanks!
> Edson Luis Moretti.
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4cbdd041-eb1f-4ac1-94d0-ea155d476f35%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to