[tesseract-ocr] Improve formatting of Bullet Points / Lists

Rarity Tue, 02 Jun 2015 05:40:19 -0700

Hello Tesseract-OCR community,

I am well happy with the quality of the conversions, however, when OCR'ing 
bullet points, the output formatting of the text file is all wrong. The 
text file first lists bullet point numbers, and then the text.
This is not really an OCR issue, as all the text is recognized correctly, 
but I want to know if I can fix the formatting as well.



I cannot show snippets of documents where it went wrong, but I can show an 
example:


*Input file:*

Lorem ipsum dolor sit am.

   1. Ex vero phaedrum ius. 
   2. appareat patrioque mea. Has at alienum scaevola indoctum
   3.  No his modo quaerendum,
   4.  consul eruditi ex vim.
   

*Output file:*

Lorem ipsum dolor sit am.

1. Ex vero phaedrum ius. 
2.
3.
4.

appareat patrioque mea. Has at alienum scaevola indoctum
 No his modo quaerendum,
 consul eruditi ex vim.





*Bonus question:*
*Assuming Google Docs use Tesseract-OCR, what is their setup / languages? *
*Their formatting of output PDFs is gorgeous.*

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/82898c6c-6430-4281-9a2f-6ebdd6370122%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Improve formatting of Bullet Points / Lists

Reply via email to