First, let me say that I am virtually brand-new to OCR. It is only within
the past couple of weeks that I have developed an appreciation that today's
OCR is much much more than "character" recognition. It can probably be
better described as "message" recognition, where all the nuances of
language, words, dictionaries, and syntax/grammar are as important as
individual characters. Still, the OCR application I am tackling is really
just concerned with recognizing, as accurately as possible, strings of
characters where there is no correlation between characters in a string or
from one string to the next. Maybe attempting to use today's OCR for this
application is analogous to using a jack-hammer to drive a tack. If so,
please let me know, and if you can point me towards a technology other than
OCR I'd appreciate it.
I apologize for the length of this post. I thought it best to describe as
fully as I can the application driving my interest in Tesseract, and also
what simple OCR experiments I've done to try to assess what should be
possible to achieve with Tesseract.
First, the application. This is a real-world application that probably
comes up in many contexts. I will not describe the specifics of the
application, but will describe it notionally as follows. In an enterprise
there are two sites -- A and B. Information can be produced at site A and
needs to be utilized at site B. However, there is no means for
transferring the information, electronically, from A to B. The information
is in electronic form at site A, and to be utilized efficiently needs to be
in electronic form at site B. To make things simple, assume that at site A
the information exists in tabular form as a spreadsheet, and it needs to be
re-entered into that tabular/spreadsheet form at site B. A current process
exists, consisting of printing the spreadsheet at site A, physically
transferring the printout to site B, where it is scanned to a PDF. Site B
has the Adobe Acrobat Pro tool, enabling the scanned PDF to be subjected to
OCR and the result saved in a variety of formats, one being an Excel file.
However, there are numerous errors in this process, requiring painstaking
and time-consuming editing of the Excel file, and there is no good way to
be assured that all errors have been found and corrected.
It can be further assumed that in this application the information in the
spreadsheet at site A, i.e., the contents of the spreadsheet cells,
consists of a very restricted character set -- generally, the characters
are upper case letters A-Z, digits 0-9, the decimal point, and possibly the
comma. Moreover, in printing the spreadsheet at site A, various things are
under control, such as the font used, the font size, and the printing
quality. And, at site B, the scan resolution is under control. Clearly,
it should be possible to configure these "control variables" so that the
hard-copy printout, once scanned, is as "OCR-able as possible" with the
smallest probability of character errors. This seems "obvious" to me.
However, testing this out is more difficult than I ever expected. Here
are some observations, and what I've done so far:
- I have done some research on what fonts are recognizable with the best
accuracy, and have narrowed down to a set of about 6. The OCR A Extended
font, a default font on Windows systems, seems likely to be best in this
regard. It will be possible to print a quality hard-copy at site A in any
of these fonts, and with a font size large enough to allow inter-character
discrimination, and it will be possible to scan-to-PDF at site B at 600
dpi, which should be as good as needed.
- Unfortunately, the OCR performance of Adobe Acrobat Pro on the scanned
PDF is worst with the OCR A Extended font. This is because there are no
means for configuring the Adobe Acrobat OCR engine -- it is probably
pre-configured to expect scanned material in a variety of more conventional
fonts -- and OCR A Extended is probably a font in which few materials for
actual human reading are printed (it's not a very "pleasing" font). If it
were possible to configure Adobe Acrobat to expect a particular font, and
even better a restricted alphabet of characters, that might be the ideal
tool to use, since it's already at site B.
- In researching OCR SW, I have come to the opinion that Adobe Acrobat
Pro and also ABBYY FineReader are held in the highest regard. They
generally lead the pack in reviews of commercial OCR. So, I have also
tested out, on a trial basis, ABBYY FineReader. Here is what I have found
with ABBYY:
- It is possible to "train" ABBYY to expect input in a specific font
and in a restricted character alphabet (in my case, A-Z, 0-9, decimal
point, and comma). It's not as simple as just specifying, say, OCR A
Extended, and a set of characters. Rather, material with the alphabet
and
in the chosen font must be read in, and ABBYY can be made to map each
character shape in the input to a given character. Except, see 2 bullets
below ...
- When I trained ABBYY to the restricted character alphabet, in the
OCR A Extended font, I was able to successfully complete the training --
i.e. each character "shape" in the training input could be mapped to the
appropriate character. After doing that, the "trained" ABBYY performed
as
well as could be expected in correctly recognizing two pages of scanned
material (about 2000 random characters), printed in the OCR A Extended
font. There were no errors at all. I was not able, due to restrictions
in
the ABBYY evaluation trial to test the recognition performance with more
than two pages, but it appears that ABBYY and the OCR A Extended font
could
produce the recognition accuracy I'm looking for in my application.
However, there is the expense of acquiring ABBYY at site B (actually,
there are many site B's), as well as other issues I won't get into.
- When I attempted to train ABBYY in one of the other fonts, I found
that in the training, after having mapped a number of character shapes to
correct characters, inevitably there would be a next character shape, not
yet mapped, that ABBYY considered to be already mapped, on the basis of
similarity to shapes that had been mapped. As an example, ABBYY might
allow correct mapping of the font shapes for 0-9 and A to the characters
0-9 and A, but when the font shape for B occurred, ABBYY considered this
already mapped to the character 8, and there was no way to cancel that
default mapping. Thus, for all these other fonts, it was not possible to
correctly train ABBYY to recognize all the character shapes.
- The last point above is partial confirmation that the OCR A
Extended font is probably the best for OCR recognizably, on a
per-character
basis.
So, now this FINALLY gets down to my interest in Tesseract. (Sorry for all
the verbiage above, but it helps put into context what I'd like to do with
Tesseract.) I'm hoping that, as far as ability to recognize individual
characters correctly, Tesseract should be as good as the leading commercial
OCR engines -- Adobe Acrobat Pro and ABBYY FineReader. Also, from
everything I've dug into about Tesseract, it appears that it should be
possible to "train" or "configure" Tesseract to expect input consisting of
- Characters only in a single well-defined, specific font. Currently I
am focusing on OCR A Extended, due to the positive character
recognizability results I was able to produce in my experiments with ABBYY.
- Characters only in a specific subset -- currently I would want to
assume just 0-9 and A-Z.
- Completely "random" characters, meaning no predetermined dictionary of
words, no language, no aspects of the input that would make certain
character strings more likely to be correct than others.
I know it's possible to "train" Tesseract, and probably that has to be done
with the OCR A Extended font. But the documentation on how to do that is
extremely difficult to follow, and frankly I don't have a lot of time to
spend on becoming a relative Tesseract expert. It would be so much
better/easier (for my purposes) if, when it is known a priori that the
image input is all in a specific font, to be able to convey that fact to
Tesseract on the command line. Possibly someone else has already done the
training to the OCR A Extended font, and I could reuse what was done there?
I also know that it's possible to specify "white-lists" and "black-lists"
of characters, and that that should be a way allowing me to instruct
Tesseract that the characters are known to be 0-9 or A-Z. However, I have
yet to find the best description of the actual mechanics of setting
everything up and properly configuring the command line to effect this
definition of a restricted character set.
I'm more than a little "bothered" by the role that "language" plays in
Tesseract, since in my application there are completely random character
strings. Is there a way to specify "no language" in the execution of
Tesseract, meaning no a priori set of acceptable words, nothing else that
might cause Tesseract to choose one character string over another aside
from simple the best choice in an individual character-by-character basis?
The above describes what motivates what I'm trying to do with Tesseract.
I'm hoping I can stimulate getting, from all of you Tesseract experts,
some of the following:
- Anything that's already been done that I could reuse. Specifically,
if Tesseract has already been trained to expect inputs in the OCR A
Extended font, having the results of that, and directions on how to utilize
it, would be ideal.
- Better documentation that allows me to more fully understand
everything about the Tesseract execution environment (e.g., config files,
how they are constructed, where they reside, etc.) and also all the
specific little options in the command line that I might need to know about.
Again, sorry about how long this turned out to be. It should be possible
for subsequent pots to be much shorter and to the point of specific issues.
Think of this post as providing background and definition of what I want
to accomplish with Tesseract.
Thanks.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/4c3670f6-e9ea-4429-9fdd-a6afbd8b64d4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.