subject:"\[tesseract\-ocr\] Many 'question mark' chars in recognized text"

Re: [tesseract-ocr] Many 'question mark' chars in recognized text

2014-10-17 Thread Salvo Piazza

Hi Zdenko,
thanks for your response.

I know tesseract at very beginning level, so can you tell me how can I
check it? (I use a Linux version of tesseract...)

Thanks,
Salvo.

Il giorno giovedì 16 ottobre 2014 21:46:31 UTC+2, zdenop ha scritto:

fl is recognizes as ligature in English, so there could be the same issue
in Italian. If it is replaced with '?' I would guess you have problem with
unicode... Can you check it with tesseract executable?

Zdenko

On Thu, Oct 16, 2014 at 10:18 AM, Salvo Piazza s.pi...@tsc-consulting.com
javascript: wrote:

Hi all,
I've written a little simple program to extract text from image with
tesseract 3.0.2 as:

Tesseract instance = Tesseract.getInstance();
instance.setDatapath(currentDir);
instance.setLanguage(ita);
String returner = instance.doOCR(new File(filename));

It works fine but I've many question mark chars '?' in the extracted
text.

For example the word *fluidi *is recognized as *?uidi *and much more
example...

Does anyone know some tips in order to fix this behaviour?

Thanks in advance,
Salvo.

--
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an
email to tesseract-oc...@googlegroups.com javascript:.
To post to this group, send email to tesser...@googlegroups.com
javascript:.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com

https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com?utm_medium=emailutm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/1ab08bc3-b6aa-45a7-bdfd-a6c0402e950f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Many 'question mark' chars in recognized text

2014-10-17 Thread Rick Leir

On Linux try YAGF, it is a GUI front end for Tesseract. As zdenop said,
you have a unicode problem. You need to use UTF8 for strings.

On Friday, October 17, 2014 6:07:26 AM UTC-4, Salvo Piazza wrote:

Hi Zdenko,
thanks for your response.

I know tesseract at very beginning level, so can you tell me how can I
check it? (I use a Linux version of tesseract...)

Thanks,
Salvo.

Il giorno giovedì 16 ottobre 2014 21:46:31 UTC+2, zdenop ha scritto:

Zdenko

On Thu, Oct 16, 2014 at 10:18 AM, Salvo Piazza
s.pi...@tsc-consulting.com wrote:

Hi all,
I've written a little simple program to extract text from image with
tesseract 3.0.2 as:

Tesseract instance = Tesseract.getInstance();
instance.setDatapath(currentDir);
instance.setLanguage(ita);
String returner = instance.doOCR(new File(filename));

It works fine but I've many question mark chars '?' in the extracted
text.

For example the word *fluidi *is recognized as *?uidi *and much more
example...

Does anyone know some tips in order to fix this behaviour?

Thanks in advance,
Salvo.

https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com?utm_medium=emailutm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/9f9ba913-5c43-4f38-85ea-1e2d88b5e796%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Many 'question mark' chars in recognized text

2014-10-17 Thread zdenko podobny

OCR a test image with you app, store result to text file. Than OCR the same
image with tesseract executable (output should be in text file by default)
and compare results.
If output from tesseract executable is OK, but from your app is wrong (e.g.
there are only ascii letters) = you have problem within you app (e.g. it
does not handle unicode string correctly).

Zdenko

On Fri, Oct 17, 2014 at 12:07 PM, Salvo Piazza s.pia...@tsc-consulting.com
wrote:

Hi Zdenko,
thanks for your response.

I know tesseract at very beginning level, so can you tell me how can I
check it? (I use a Linux version of tesseract...)

Thanks,
Salvo.

Il giorno giovedì 16 ottobre 2014 21:46:31 UTC+2, zdenop ha scritto:

Zdenko

On Thu, Oct 16, 2014 at 10:18 AM, Salvo Piazza
s.pi...@tsc-consulting.com wrote:

Hi all,
I've written a little simple program to extract text from image with
tesseract 3.0.2 as:

Tesseract instance = Tesseract.getInstance();
instance.setDatapath(currentDir);
instance.setLanguage(ita);
String returner = instance.doOCR(new File(filename));

It works fine but I've many question mark chars '?' in the extracted
text.

For example the word *fluidi *is recognized as *?uidi *and much more
example...

Does anyone know some tips in order to fix this behaviour?

Thanks in advance,
Salvo.

--
You received this message because you are subscribed to the Google
Groups tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send
an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/
msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%
40googlegroups.com
https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com?utm_medium=emailutm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an
email to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/1ab08bc3-b6aa-45a7-bdfd-a6c0402e950f%40googlegroups.com
https://groups.google.com/d/msgid/tesseract-ocr/1ab08bc3-b6aa-45a7-bdfd-a6c0402e950f%40googlegroups.com?utm_medium=emailutm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xARq_qH_UQftQUzFZPOb_-4-_LGmQOEXWu1x%3D61CdiNw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Many 'question mark' chars in recognized text

2014-10-16 Thread Salvo Piazza

Hi all,
I've written a little simple program to extract text from image with 
tesseract 3.0.2 as:

Tesseract instance = Tesseract.getInstance();
instance.setDatapath(currentDir);
instance.setLanguage(ita);
String returner = instance.doOCR(new File(filename));

It works fine but I've many question mark chars '?' in the extracted text.

For example the word *fluidi *is recognized as *?uidi *and much more 
example...

Does anyone know some tips in order to fix this behaviour?

Thanks in advance,
Salvo.

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Many 'question mark' chars in recognized text

2014-10-16 Thread Greg Dunkel

Many OCR programs have trouble with ligatures.
On Oct 16, 2014 11:21 AM, Salvo Piazza s.pia...@tsc-consulting.com
wrote:

Hi all,
I've written a little simple program to extract text from image with
tesseract 3.0.2 as:

Tesseract instance = Tesseract.getInstance();
instance.setDatapath(currentDir);
instance.setLanguage(ita);
String returner = instance.doOCR(new File(filename));

It works fine but I've many question mark chars '?' in the extracted text.

For example the word *fluidi *is recognized as *?uidi *and much more
example...

Does anyone know some tips in order to fix this behaviour?

Thanks in advance,
Salvo.

--
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an
email to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com
https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com?utm_medium=emailutm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/CA%2BOX7tr6zWdceD6REe_YrxAyzV-jsc40a5fteUD7cBo%2BOmX6Tg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Many 'question mark' chars in recognized text

Re: [tesseract-ocr] Many 'question mark' chars in recognized text

Re: [tesseract-ocr] Many 'question mark' chars in recognized text

[tesseract-ocr] Many 'question mark' chars in recognized text

Re: [tesseract-ocr] Many 'question mark' chars in recognized text

5 matches

Site Navigation

Mail list logo

Footer information