Re: [tesseract-ocr] approches used for language detection on images ...

Albretch Mueller Tue, 04 Feb 2020 04:08:56 -0800

On 2/1/20, Zdenko Podobny <[email protected]> wrote:
> You did not provide any example Image


 OK, this one would do. On this pdf file there are images of varying
quality and with text embedded in various ways. This would be the
typical text I would be dealing with:

 https://www.nysedregents.org/USHistoryGov/Archive/20020122exam.pdf

 another example of textual file I work with would be:

 https://scholarworks.iu.edu/dspace/bitstream/handle/2022/18961/Notes
and Texts.pdf

 on that file pdftohtml produces one background file per page, but
when you stratify the content (simply using hash signatures) you
realize most files are of the same kind (just blank background images
or files containing a single line (for example, underlining a title)
or framing a blocked message), then there are full-page blank images
with segments of greek text, ...

 I don't quite understand why poppler utils don't just underline a
word. Of course, you could easily write some code to figure out which
segments of text should be underlined, but understanding the obvious
tends to pay in the long run

> , neither what kind of tools you would
> like to use (open source or proprietary)

 the poppler's pdftohtml tools:

 https://poppler.freedesktop.org/

 are pretty good, but there is always an extra twicking you need.
Authors write texts in whichever way they want and this is a good
thing

>    4. I guess you will have problem with texts with mixed languages.

 Yes, I do, but a few heuristics included in metadata (extracted from
the names and/or headings of files) are of great help

 At the end of the day you can't fully automate such a process. You
will need a GUI and let "users" eye ball the data . . .

>    5. If  proprietary tools (and budget ;-) ) are not problem you can try
>    to use  google vision [6] or Microsoft cognitive services [7] or Amazon
>    Rekognition. Dataturks made some test for them [9]...

 I am trying to write up a set of bash scripts and code as part of a
pretty complete all-purpose library. Ideally the back end text will be
formatter as ODT since it is very easy to convert it to any other
format anyway

 Do you know of such a library?

> [1] ... [9]

 Thank you,
 lbrtchx

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFakBwj_b5uxQaP-%3DYv_1VP6%3DNG5B1OYjCOT2LLJAdKr%2BTX66A%40mail.gmail.com.

Re: [tesseract-ocr] approches used for language detection on images ...

Reply via email to