I think the hocr output has an option to output bounding info per character also.
On Fri, 31 May 2019, 19:07 G. S., <[email protected]> wrote: > Dear all, > > i have a pdf image file, (in Greek language) > > i would appreciate if you could help me on how i could > > a) have an output similar to what pdf alto does, > > but more important, have the position width and height info in a per > character base. > > Up to now, pdfalto considers each word to be a token, so the output is on > a per word base. > > https://github.com/kermitt2/pdfalto/issues/34 > > > Please tell me how would you approach this with > > https://github.com/tesseract-ocr > > which command and which parameters you would use? > > thank you very much in advance > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/32091990-88b9-426d-94f0-2c5278a9b9da%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/32091990-88b9-426d-94f0-2c5278a9b9da%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWNW_cOY08Q7H2W7UkRXJNb24KT3TsiQ6FkUPAJEod%2BaA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

