Thanks shree. I´ll have a look at gimagereader. Looks like promising.
El viernes, 20 de marzo de 2020, 13:27:22 (UTC+1), shree escribió: > Take a look at gimagereader, which uses tesseract . It has the options you > are looking for. > > On Fri, Mar 20, 2020, 17:55 Dayton <[email protected] <javascript:>> > wrote: > >> I have output to hocr and tsv but I still get the all text without hard >> return or any separation between paragraphs. >> >> Is there an HOCR tool which allows to export to Microsoft Word? >> >> The original document is in PDF format. It´s actually an official >> document. >> >> First, I run ImageMagick and got a cleaned TIFF file. >> >> After that, I run Tesseract, so I think it does not make sense to back >> convert the TIFF to PDF again. >> >> I simply need an export format from Tesseract that allows MS Word to see >> the text properly, not with lines of code. >> >> Thanks! >> >> El jueves, 19 de marzo de 2020, 23:07:06 (UTC+1), zdenop escribió: >> >>> Checkout output to hocr (which is html output), tsv or pdf. See doc. >>> >>> Zdenko >>> >>> >>> št 19. 3. 2020 o 8:04 Dayton <[email protected]> napísal(a): >>> >>>> Hi All, >>>> >>>> I´m using Tesseract for Windows to OCR scanned documents and then >>>> format the layout in Word in a later stage. >>>> >>>> The text extraction that I get in the .TXT output does not add any hard >>>> return or any separation between paragraphs, so I have to spend many time >>>> to guess where are the end of each line. >>>> >>>> Is there any way to add a parameter in the line code to add separations >>>> between paragraphs? >>>> >>>> Should I use another output format instead of TXT in order to make >>>> easier the formatting in Word? >>>> >>>> Thanks! >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/a6e27031-89a2-4800-a574-48f738b439a0%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/a6e27031-89a2-4800-a574-48f738b439a0%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/82dcb69e-b70c-4738-8dbc-2671ad6cae75%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/82dcb69e-b70c-4738-8dbc-2671ad6cae75%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/335fd36c-8194-441e-a2d7-9012204e2e4f%40googlegroups.com.

