First, let me say that I am virtually brand-new to OCR.  It is only within 
the past couple of weeks that I have developed an appreciation that today's 
OCR is much much more than "character" recognition.  It can probably be 
better described as "message" recognition, where all the nuances of 
language, words, dictionaries, and syntax/grammar are as important as 
individual characters.  Still, the OCR application I am tackling is really 
just concerned with recognizing, as accurately as possible, strings of 
characters where there is no correlation between characters in a string or 
from one string to the next.  Maybe attempting to use today's OCR for this 
application is analogous to using a jack-hammer to drive a tack.  If so, 
please let me know, and if you can point me towards a technology other than 
OCR I'd appreciate it.

I apologize for the length of this post.  I thought it best to describe as 
fully as I can the application driving my interest in Tesseract, and also 
what simple OCR experiments I've done to try to assess what should be 
possible to achieve with Tesseract.

First, the application.  This is a real-world application that probably 
comes up in many contexts.  I will not describe the specifics of the 
application, but will describe it notionally as follows.  In an enterprise 
there are two sites -- A and B.  Information can be produced at site A and 
needs to be utilized at site B.  However, there is no means for 
transferring the information, electronically, from A to B.  The information 
is in electronic form at site A, and to be utilized efficiently needs to be 
in electronic form at site B.  To make things simple, assume that at site A 
the information exists in tabular form as a spreadsheet, and it needs to be 
re-entered into that tabular/spreadsheet form at site B.  A current process 
exists, consisting of printing the spreadsheet at site A, physically 
transferring the printout to site B, where it is scanned to a PDF.  Site B 
has the Adobe Acrobat Pro tool, enabling the scanned PDF to be subjected to 
OCR and the result saved in a variety of formats, one being an Excel file. 
 However, there are numerous errors in this process, requiring painstaking 
and time-consuming editing of the Excel file, and there is no good way to 
be assured that all errors have been found and corrected.

It can be further assumed that in this application the information in the 
spreadsheet at site A, i.e., the contents of the spreadsheet cells, 
consists of a very restricted character set -- generally, the characters 
are upper case letters A-Z, digits 0-9, the decimal point, and possibly the 
comma.  Moreover, in printing the spreadsheet at site A, various things are 
under control, such as the font used, the font size, and the printing 
quality.  And, at site B, the scan resolution is under control.  Clearly, 
it should be possible to configure these "control variables" so that the 
hard-copy printout, once scanned, is as "OCR-able as possible" with the 
smallest probability of character errors.  This seems "obvious" to me. 
 However, testing this out is more difficult than I ever expected.  Here 
are some observations, and what I've done so far:

   - I have done some research on what fonts are recognizable with the best 
   accuracy, and have narrowed down to a set of about 6.  The OCR A Extended 
   font, a default font on Windows systems, seems likely to be best in this 
   regard.  It will be possible to print a quality hard-copy at site A in any 
   of these fonts, and with a font size large enough to allow inter-character 
   discrimination, and it will be possible to scan-to-PDF at site B at 600 
   dpi, which should be as good as needed.
   - Unfortunately, the OCR performance of Adobe Acrobat Pro on the scanned 
   PDF is worst with the OCR A Extended font.  This is because there are no 
   means for configuring the Adobe Acrobat OCR engine -- it is probably 
   pre-configured to expect scanned material in a variety of more conventional 
   fonts -- and OCR A Extended is probably a font in which few materials for 
   actual human reading are printed (it's not a very "pleasing" font).  If it 
   were possible to configure Adobe Acrobat to expect a particular font, and 
   even better a restricted alphabet of characters, that might be the ideal 
   tool to use, since it's already at site B.
   - In researching OCR SW, I have come to the opinion that Adobe Acrobat 
   Pro and also ABBYY FineReader are held in the highest regard.  They 
   generally lead the pack in reviews of commercial OCR.  So, I have also 
   tested out, on a trial basis, ABBYY FineReader.  Here is what I have found 
   with ABBYY:
      - It is possible to "train" ABBYY to expect input in a specific font 
      and in a restricted character alphabet (in my case, A-Z, 0-9, decimal 
      point, and comma).  It's not as simple as just specifying, say, OCR A 
      Extended, and a set of characters.  Rather, material with the alphabet 
and 
      in the chosen font must be read in, and ABBYY can be made to map each 
      character shape in the input to a given character.  Except, see 2 bullets 
      below ...
      - When I trained ABBYY to the restricted character alphabet, in the 
      OCR A Extended font, I was able to successfully complete the training -- 
      i.e. each character "shape" in the training input could be mapped to the 
      appropriate character.  After doing that, the "trained" ABBYY performed 
as 
      well as could be expected in correctly recognizing two pages of scanned 
      material (about 2000 random characters), printed in the OCR A Extended 
      font.  There were no errors at all.  I was not able, due to restrictions 
in 
      the ABBYY evaluation trial to test the recognition performance with more 
      than two pages, but it appears that ABBYY and the OCR A Extended font 
could 
      produce the recognition accuracy I'm looking for in my application. 
       However, there is the expense of acquiring ABBYY at site B (actually, 
      there are many site B's), as well as other issues I won't get into.
      - When I attempted to train ABBYY in one of the other fonts, I found 
      that in the training, after having mapped a number of character shapes to 
      correct characters, inevitably there would be a next character shape, not 
      yet mapped, that ABBYY considered to be already mapped, on the basis of 
      similarity to shapes that had been mapped.  As an example, ABBYY might 
      allow correct mapping of the font shapes for 0-9 and A to the characters 
      0-9 and A, but when the font shape for B occurred, ABBYY considered this 
      already mapped to the character 8, and there was no way to cancel that 
      default mapping.  Thus, for all these other fonts, it was not possible to 
      correctly train ABBYY to recognize all the character shapes.
      - The last point above is partial confirmation that the OCR A 
      Extended font is probably the best for OCR recognizably, on a 
per-character 
      basis.
   
So, now this FINALLY gets down to my interest in Tesseract.  (Sorry for all 
the verbiage above, but it helps put into context what I'd like to do with 
Tesseract.)  I'm hoping that, as far as ability to recognize individual 
characters correctly, Tesseract should be as good as the leading commercial 
OCR engines -- Adobe Acrobat Pro and ABBYY FineReader.  Also, from 
everything I've dug into about Tesseract, it appears that it should be 
possible to "train" or "configure" Tesseract to expect input consisting of

   - Characters only in a single well-defined, specific font.  Currently I 
   am focusing on OCR A Extended, due to the positive character 
   recognizability results I was able to produce in my experiments with ABBYY.
   - Characters only in a specific subset -- currently I would want to 
   assume just 0-9 and A-Z.
   - Completely "random" characters, meaning no predetermined dictionary of 
   words, no language, no aspects of the input that would make certain 
   character strings more likely to be correct than others.

I know it's possible to "train" Tesseract, and probably that has to be done 
with the OCR A Extended font.  But the documentation on how to do that is 
extremely difficult to follow, and frankly I don't have a lot of time to 
spend on becoming a relative Tesseract expert.  It would be so much 
better/easier (for my purposes) if, when it is known a priori that the 
image input is all in a specific font, to be able to convey that fact to 
Tesseract on the command line.  Possibly someone else has already done the 
training to the OCR A Extended font, and I could reuse what was done there?

I also know that it's possible to specify "white-lists" and "black-lists" 
of characters, and that that should be a way allowing me to instruct 
Tesseract that the characters are known to be 0-9 or A-Z.  However, I have 
yet to find the best description of the actual mechanics of setting 
everything up and properly configuring the command line to effect this 
definition of a restricted character set.

I'm more than a little "bothered" by the role that "language" plays in 
Tesseract, since in my application there are completely random character 
strings.  Is there a way to specify "no language" in the execution of 
Tesseract, meaning no a priori set of acceptable words, nothing else that 
might cause Tesseract to choose one character string over another aside 
from simple the best choice in an individual character-by-character basis?

The above describes what motivates what I'm trying to do with Tesseract. 
 I'm hoping I can stimulate getting, from all of you Tesseract experts, 
some of the following:

   - Anything that's already been done that I could reuse.  Specifically, 
   if Tesseract has already been trained to expect inputs in the OCR A 
   Extended font, having the results of that, and directions on how to utilize 
   it, would be ideal.
   - Better documentation that allows me to more fully understand 
   everything about the Tesseract execution environment (e.g., config files, 
   how they are constructed, where they reside, etc.) and also all the 
   specific little options in the command line that I might need to know about.

Again, sorry about how long this turned out to be.  It should be possible 
for subsequent pots to be much shorter and to the point of specific issues. 
 Think of this post as providing background and definition of what I want 
to accomplish with Tesseract.

Thanks.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4c3670f6-e9ea-4429-9fdd-a6afbd8b64d4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to