Thanks shree. I´ll have a look at gimagereader. Looks like promising.


El viernes, 20 de marzo de 2020, 13:27:22 (UTC+1), shree escribió:

> Take a look at gimagereader, which uses tesseract . It has the options you 
> are looking for.
>
> On Fri, Mar 20, 2020, 17:55 Dayton <[email protected] <javascript:>> 
> wrote:
>
>> I have output to hocr and tsv but I still get the all text without hard 
>> return or any separation between paragraphs.
>>
>> Is there an HOCR tool which allows to export to Microsoft Word?
>>
>> The original document is in PDF format. It´s actually an official 
>> document. 
>>
>> First, I run ImageMagick and got a cleaned TIFF file. 
>>
>> After that, I run Tesseract, so I think it does not make sense to back 
>> convert the TIFF to PDF again. 
>>
>> I simply need an export format from Tesseract that allows MS Word to see 
>> the text properly, not with lines of code.
>>
>> Thanks!
>>
>> El jueves, 19 de marzo de 2020, 23:07:06 (UTC+1), zdenop escribió:
>>
>>> Checkout output to hocr (which is html output), tsv or pdf. See doc.
>>>
>>> Zdenko
>>>
>>>
>>> št 19. 3. 2020 o 8:04 Dayton <[email protected]> napísal(a):
>>>
>>>> Hi All,
>>>>
>>>> I´m using Tesseract for Windows to OCR scanned documents and then 
>>>> format the layout in Word in a later stage.
>>>>
>>>> The text extraction that I get in the .TXT output does not add any hard 
>>>> return or any separation between paragraphs, so I have to spend many time 
>>>> to guess where are the end of each line. 
>>>>
>>>> Is there any way to add a parameter in the line code to add separations 
>>>> between paragraphs?
>>>>
>>>> Should I use another output format instead of TXT in order to make 
>>>> easier the formatting in Word?
>>>>
>>>> Thanks!
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/a6e27031-89a2-4800-a574-48f738b439a0%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a6e27031-89a2-4800-a574-48f738b439a0%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/82dcb69e-b70c-4738-8dbc-2671ad6cae75%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/82dcb69e-b70c-4738-8dbc-2671ad6cae75%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/335fd36c-8194-441e-a2d7-9012204e2e4f%40googlegroups.com.

Reply via email to