Take a look at gimagereader, which uses tesseract . It has the options you
are looking for.

On Fri, Mar 20, 2020, 17:55 Dayton <[email protected]> wrote:

> I have output to hocr and tsv but I still get the all text without hard
> return or any separation between paragraphs.
>
> Is there an HOCR tool which allows to export to Microsoft Word?
>
> The original document is in PDF format. It´s actually an official
> document.
>
> First, I run ImageMagick and got a cleaned TIFF file.
>
> After that, I run Tesseract, so I think it does not make sense to back
> convert the TIFF to PDF again.
>
> I simply need an export format from Tesseract that allows MS Word to see
> the text properly, not with lines of code.
>
> Thanks!
>
> El jueves, 19 de marzo de 2020, 23:07:06 (UTC+1), zdenop escribió:
>
>> Checkout output to hocr (which is html output), tsv or pdf. See doc.
>>
>> Zdenko
>>
>>
>> št 19. 3. 2020 o 8:04 Dayton <[email protected]> napísal(a):
>>
>>> Hi All,
>>>
>>> I´m using Tesseract for Windows to OCR scanned documents and then format
>>> the layout in Word in a later stage.
>>>
>>> The text extraction that I get in the .TXT output does not add any hard
>>> return or any separation between paragraphs, so I have to spend many time
>>> to guess where are the end of each line.
>>>
>>> Is there any way to add a parameter in the line code to add separations
>>> between paragraphs?
>>>
>>> Should I use another output format instead of TXT in order to make
>>> easier the formatting in Word?
>>>
>>> Thanks!
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/a6e27031-89a2-4800-a574-48f738b439a0%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/a6e27031-89a2-4800-a574-48f738b439a0%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/82dcb69e-b70c-4738-8dbc-2671ad6cae75%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/82dcb69e-b70c-4738-8dbc-2671ad6cae75%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWYAaaDyJo7hFFppmW%2BHAmE%2BJ4ArdiHf89qLsvCSubxsw%40mail.gmail.com.

Reply via email to