Re: [tesseract-ocr] Many 'question mark' chars in recognized text

2014-10-17 Thread Salvo Piazza
Hi Zdenko,
thanks for your response.

I know tesseract at very beginning level, so can you tell me how can I 
check it? (I use a Linux version of tesseract...)

Thanks,
Salvo.


Il giorno giovedì 16 ottobre 2014 21:46:31 UTC+2, zdenop ha scritto:

 fl is recognizes as ligature in English, so there could be the same issue 
 in Italian. If it is replaced with '?' I would guess you have problem with 
 unicode... Can you check it with tesseract executable?

 Zdenko

 On Thu, Oct 16, 2014 at 10:18 AM, Salvo Piazza s.pi...@tsc-consulting.com 
 javascript: wrote:

 Hi all,
 I've written a little simple program to extract text from image with 
 tesseract 3.0.2 as:

 Tesseract instance = Tesseract.getInstance();
 instance.setDatapath(currentDir);
 instance.setLanguage(ita);
 String returner = instance.doOCR(new File(filename));

 It works fine but I've many question mark chars '?' in the extracted 
 text.

 For example the word *fluidi *is recognized as *?uidi *and much more 
 example...

 Does anyone know some tips in order to fix this behaviour?

 Thanks in advance,
 Salvo.

  -- 
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to tesseract-oc...@googlegroups.com javascript:.
 To post to this group, send email to tesser...@googlegroups.com 
 javascript:.
 Visit this group at http://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com
  
 https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.




-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1ab08bc3-b6aa-45a7-bdfd-a6c0402e950f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Many 'question mark' chars in recognized text

2014-10-17 Thread Rick Leir
On Linux try YAGF, it is a GUI front end for Tesseract.  As zdenop said, 
you have a unicode problem.  You need to use UTF8 for strings.

On Friday, October 17, 2014 6:07:26 AM UTC-4, Salvo Piazza wrote:

 Hi Zdenko,
 thanks for your response.

 I know tesseract at very beginning level, so can you tell me how can I 
 check it? (I use a Linux version of tesseract...)

 Thanks,
 Salvo.


 Il giorno giovedì 16 ottobre 2014 21:46:31 UTC+2, zdenop ha scritto:

 fl is recognizes as ligature in English, so there could be the same issue 
 in Italian. If it is replaced with '?' I would guess you have problem with 
 unicode... Can you check it with tesseract executable?

 Zdenko

 On Thu, Oct 16, 2014 at 10:18 AM, Salvo Piazza 
 s.pi...@tsc-consulting.com wrote:

 Hi all,
 I've written a little simple program to extract text from image with 
 tesseract 3.0.2 as:

 Tesseract instance = Tesseract.getInstance();
 instance.setDatapath(currentDir);
 instance.setLanguage(ita);
 String returner = instance.doOCR(new File(filename));

 It works fine but I've many question mark chars '?' in the extracted 
 text.

 For example the word *fluidi *is recognized as *?uidi *and much more 
 example...

 Does anyone know some tips in order to fix this behaviour?

 Thanks in advance,
 Salvo.

  -- 
 You received this message because you are subscribed to the Google 
 Groups tesseract-ocr group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at http://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com
  
 https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.




-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9f9ba913-5c43-4f38-85ea-1e2d88b5e796%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Many 'question mark' chars in recognized text

2014-10-17 Thread zdenko podobny
OCR a test image with you app, store result to text file. Than OCR the same
image with tesseract executable (output should be in text file by default)
and compare results.
If output from tesseract executable is OK, but from your app is wrong (e.g.
there are only ascii letters) = you have problem within you app (e.g. it
does not handle unicode string correctly).

Zdenko

On Fri, Oct 17, 2014 at 12:07 PM, Salvo Piazza s.pia...@tsc-consulting.com
wrote:

 Hi Zdenko,
 thanks for your response.

 I know tesseract at very beginning level, so can you tell me how can I
 check it? (I use a Linux version of tesseract...)

 Thanks,
 Salvo.


 Il giorno giovedì 16 ottobre 2014 21:46:31 UTC+2, zdenop ha scritto:

 fl is recognizes as ligature in English, so there could be the same issue
 in Italian. If it is replaced with '?' I would guess you have problem with
 unicode... Can you check it with tesseract executable?

 Zdenko

 On Thu, Oct 16, 2014 at 10:18 AM, Salvo Piazza 
 s.pi...@tsc-consulting.com wrote:

 Hi all,
 I've written a little simple program to extract text from image with
 tesseract 3.0.2 as:

 Tesseract instance = Tesseract.getInstance();
 instance.setDatapath(currentDir);
 instance.setLanguage(ita);
 String returner = instance.doOCR(new File(filename));

 It works fine but I've many question mark chars '?' in the extracted
 text.

 For example the word *fluidi *is recognized as *?uidi *and much more
 example...

 Does anyone know some tips in order to fix this behaviour?

 Thanks in advance,
 Salvo.

  --
 You received this message because you are subscribed to the Google
 Groups tesseract-ocr group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at http://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit https://groups.google.com/d/
 msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%
 40googlegroups.com
 https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.


  --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to tesseract-ocr+unsubscr...@googlegroups.com.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 Visit this group at http://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/tesseract-ocr/1ab08bc3-b6aa-45a7-bdfd-a6c0402e950f%40googlegroups.com
 https://groups.google.com/d/msgid/tesseract-ocr/1ab08bc3-b6aa-45a7-bdfd-a6c0402e950f%40googlegroups.com?utm_medium=emailutm_source=footer
 .

 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xARq_qH_UQftQUzFZPOb_-4-_LGmQOEXWu1x%3D61CdiNw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Many 'question mark' chars in recognized text

2014-10-16 Thread Salvo Piazza
Hi all,
I've written a little simple program to extract text from image with 
tesseract 3.0.2 as:

Tesseract instance = Tesseract.getInstance();
instance.setDatapath(currentDir);
instance.setLanguage(ita);
String returner = instance.doOCR(new File(filename));

It works fine but I've many question mark chars '?' in the extracted text.

For example the word *fluidi *is recognized as *?uidi *and much more 
example...

Does anyone know some tips in order to fix this behaviour?

Thanks in advance,
Salvo.

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Many 'question mark' chars in recognized text

2014-10-16 Thread Greg Dunkel
Many OCR programs have trouble with ligatures.
On Oct 16, 2014 11:21 AM, Salvo Piazza s.pia...@tsc-consulting.com
wrote:

 Hi all,
 I've written a little simple program to extract text from image with
 tesseract 3.0.2 as:

 Tesseract instance = Tesseract.getInstance();
 instance.setDatapath(currentDir);
 instance.setLanguage(ita);
 String returner = instance.doOCR(new File(filename));

 It works fine but I've many question mark chars '?' in the extracted text.

 For example the word *fluidi *is recognized as *?uidi *and much more
 example...

 Does anyone know some tips in order to fix this behaviour?

 Thanks in advance,
 Salvo.

  --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to tesseract-ocr+unsubscr...@googlegroups.com.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 Visit this group at http://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com
 https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CA%2BOX7tr6zWdceD6REe_YrxAyzV-jsc40a5fteUD7cBo%2BOmX6Tg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.