Take a look at gimagereader, which uses tesseract . It has the options you are looking for.
On Fri, Mar 20, 2020, 17:55 Dayton <[email protected]> wrote: > I have output to hocr and tsv but I still get the all text without hard > return or any separation between paragraphs. > > Is there an HOCR tool which allows to export to Microsoft Word? > > The original document is in PDF format. It´s actually an official > document. > > First, I run ImageMagick and got a cleaned TIFF file. > > After that, I run Tesseract, so I think it does not make sense to back > convert the TIFF to PDF again. > > I simply need an export format from Tesseract that allows MS Word to see > the text properly, not with lines of code. > > Thanks! > > El jueves, 19 de marzo de 2020, 23:07:06 (UTC+1), zdenop escribió: > >> Checkout output to hocr (which is html output), tsv or pdf. See doc. >> >> Zdenko >> >> >> št 19. 3. 2020 o 8:04 Dayton <[email protected]> napísal(a): >> >>> Hi All, >>> >>> I´m using Tesseract for Windows to OCR scanned documents and then format >>> the layout in Word in a later stage. >>> >>> The text extraction that I get in the .TXT output does not add any hard >>> return or any separation between paragraphs, so I have to spend many time >>> to guess where are the end of each line. >>> >>> Is there any way to add a parameter in the line code to add separations >>> between paragraphs? >>> >>> Should I use another output format instead of TXT in order to make >>> easier the formatting in Word? >>> >>> Thanks! >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/a6e27031-89a2-4800-a574-48f738b439a0%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/a6e27031-89a2-4800-a574-48f738b439a0%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/82dcb69e-b70c-4738-8dbc-2671ad6cae75%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/82dcb69e-b70c-4738-8dbc-2671ad6cae75%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWYAaaDyJo7hFFppmW%2BHAmE%2BJ4ArdiHf89qLsvCSubxsw%40mail.gmail.com.

