Hello Tesseract-OCR community, I am well happy with the quality of the conversions, however, when OCR'ing bullet points, the output formatting of the text file is all wrong. The text file first lists bullet point numbers, and then the text. This is not really an OCR issue, as all the text is recognized correctly, but I want to know if I can fix the formatting as well.
I cannot show snippets of documents where it went wrong, but I can show an example: *Input file:* Lorem ipsum dolor sit am. 1. Ex vero phaedrum ius. 2. appareat patrioque mea. Has at alienum scaevola indoctum 3. No his modo quaerendum, 4. consul eruditi ex vim. *Output file:* Lorem ipsum dolor sit am. 1. Ex vero phaedrum ius. 2. 3. 4. appareat patrioque mea. Has at alienum scaevola indoctum No his modo quaerendum, consul eruditi ex vim. *Bonus question:* *Assuming Google Docs use Tesseract-OCR, what is their setup / languages? * *Their formatting of output PDFs is gorgeous.* -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/82898c6c-6430-4281-9a2f-6ebdd6370122%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

