I have output to hocr and tsv but I still get the all text without hard return or any separation between paragraphs.
Is there an HOCR tool which allows to export to Microsoft Word? The original document is in PDF format. It´s actually an official document. First, I run ImageMagick and got a cleaned TIFF file. After that, I run Tesseract, so I think it does not make sense to back convert the TIFF to PDF again. I simply need an export format from Tesseract that allows MS Word to see the text properly, not with lines of code. Thanks! El jueves, 19 de marzo de 2020, 23:07:06 (UTC+1), zdenop escribió: > Checkout output to hocr (which is html output), tsv or pdf. See doc. > > Zdenko > > > št 19. 3. 2020 o 8:04 Dayton <[email protected] <javascript:>> > napísal(a): > >> Hi All, >> >> I´m using Tesseract for Windows to OCR scanned documents and then format >> the layout in Word in a later stage. >> >> The text extraction that I get in the .TXT output does not add any hard >> return or any separation between paragraphs, so I have to spend many time >> to guess where are the end of each line. >> >> Is there any way to add a parameter in the line code to add separations >> between paragraphs? >> >> Should I use another output format instead of TXT in order to make easier >> the formatting in Word? >> >> Thanks! >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/a6e27031-89a2-4800-a574-48f738b439a0%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/a6e27031-89a2-4800-a574-48f738b439a0%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/82dcb69e-b70c-4738-8dbc-2671ad6cae75%40googlegroups.com.

