Help port to Java on Linux: Tesseract only returning ~

James Le Cuirot Tue, 01 Dec 2009 17:49:54 -0800

Hi guys,

Some of you may know of the Tesjeract project that brings Tesseract to
Java on Windows. I'm interesting in running Audiveris (which uses
Tesjeract) on Linux so I've been trying to port Tesjeract to Linux. It
uses tessdll.dll, which initially posed a problem because there is no
direct equivalent for Linux.


I eventually managed to create a Linux equivalent version. To keep
things simple, it only contains the minimum needed by Tesjeract. In
order to build it, I had to build Tesseract itself as shared
libraries, something that hasn't been done on Linux up till now. See
issue 174 about that but I don't think it's the cause of the problem
I'm about to mention.

I now have Audiveris successfully calling Tesseract through Tesjeract
without crashing. The problem is that only ~ is being returned. I
understand that this is what happens when a glyph isn't recognised?
I've checked the files being fed to Tesseract and they are indeed
uncompressed 8-bit TIFFs.

Here's a sample:
http://groups.google.com/group/tesseract-ocr/web/tesjeract-linux.tiff

Tesseract correctly reports the image information. Here's some of the
output I get:
[java] Image has 8 * 1 bits per pixel, and size (83,24)
[java] Resolution=1
[java]  omr.glyph.text.Sentence.recognize(Sentence.java:636) -- INFO:
Glyph#763 (eng)->"~"
[java] Image has 8 * 1 bits per pixel, and size (80,24)
[java] Resolution=1
[java]  omr.glyph.text.Sentence.recognize(Sentence.java:636) -- INFO:
Glyph#768 (eng)->"~"
[java] Image has 8 * 1 bits per pixel, and size (50,23)
[java] Resolution=1
[java]  omr.glyph.text.Sentence.recognize(Sentence.java:636) -- INFO:
Glyph#438 (eng)->"~"
[java] Image has 8 * 1 bits per pixel, and size (617,27)
[java] Resolution=1
[java]  omr.glyph.text.Sentence.recognize(Sentence.java:636) -- INFO:
Glyph#765 (eng)->"~"

I also fed the sample to Tesseract directly and it correctly
recognised it as "COUNTRY" so it seems there's nothing wrong with my
installation. I did see a mailing list thread about the DLL returning
~ when the application didn't but I got the impression that this
problem was fixed. Since I've only mimicked existing code, I don't
have a deep understand of how Tesseract works so I'd appreciate it if
someone could take a look and see if there's anything obviously wrong.
Someone familiar with the DLL and/or Tesjeract should be able to
follow this code quite easily.

http://groups.google.com/group/tesseract-ocr/web/tesjeract-linux.zip

If you want to try and build it, let me know and I'll post some
instructions. Note that I'm using the latest Tesseract code from SVN.

--

You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Help port to Java on Linux: Tesseract only returning ~

Reply via email to