Re: [tesseract-ocr] Many 'question mark' chars in recognized text
Hi Zdenko, thanks for your response. I know tesseract at very beginning level, so can you tell me how can I check it? (I use a Linux version of tesseract...) Thanks, Salvo. Il giorno giovedì 16 ottobre 2014 21:46:31 UTC+2, zdenop ha scritto: fl is recognizes as ligature in English, so there could be the same issue in Italian. If it is replaced with '?' I would guess you have problem with unicode... Can you check it with tesseract executable? Zdenko On Thu, Oct 16, 2014 at 10:18 AM, Salvo Piazza s.pi...@tsc-consulting.com javascript: wrote: Hi all, I've written a little simple program to extract text from image with tesseract 3.0.2 as: Tesseract instance = Tesseract.getInstance(); instance.setDatapath(currentDir); instance.setLanguage(ita); String returner = instance.doOCR(new File(filename)); It works fine but I've many question mark chars '?' in the extracted text. For example the word *fluidi *is recognized as *?uidi *and much more example... Does anyone know some tips in order to fix this behaviour? Thanks in advance, Salvo. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com javascript:. To post to this group, send email to tesser...@googlegroups.com javascript:. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1ab08bc3-b6aa-45a7-bdfd-a6c0402e950f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [tesseract-ocr] Many 'question mark' chars in recognized text
On Linux try YAGF, it is a GUI front end for Tesseract. As zdenop said, you have a unicode problem. You need to use UTF8 for strings. On Friday, October 17, 2014 6:07:26 AM UTC-4, Salvo Piazza wrote: Hi Zdenko, thanks for your response. I know tesseract at very beginning level, so can you tell me how can I check it? (I use a Linux version of tesseract...) Thanks, Salvo. Il giorno giovedì 16 ottobre 2014 21:46:31 UTC+2, zdenop ha scritto: fl is recognizes as ligature in English, so there could be the same issue in Italian. If it is replaced with '?' I would guess you have problem with unicode... Can you check it with tesseract executable? Zdenko On Thu, Oct 16, 2014 at 10:18 AM, Salvo Piazza s.pi...@tsc-consulting.com wrote: Hi all, I've written a little simple program to extract text from image with tesseract 3.0.2 as: Tesseract instance = Tesseract.getInstance(); instance.setDatapath(currentDir); instance.setLanguage(ita); String returner = instance.doOCR(new File(filename)); It works fine but I've many question mark chars '?' in the extracted text. For example the word *fluidi *is recognized as *?uidi *and much more example... Does anyone know some tips in order to fix this behaviour? Thanks in advance, Salvo. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com. To post to this group, send email to tesser...@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9f9ba913-5c43-4f38-85ea-1e2d88b5e796%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [tesseract-ocr] Many 'question mark' chars in recognized text
OCR a test image with you app, store result to text file. Than OCR the same image with tesseract executable (output should be in text file by default) and compare results. If output from tesseract executable is OK, but from your app is wrong (e.g. there are only ascii letters) = you have problem within you app (e.g. it does not handle unicode string correctly). Zdenko On Fri, Oct 17, 2014 at 12:07 PM, Salvo Piazza s.pia...@tsc-consulting.com wrote: Hi Zdenko, thanks for your response. I know tesseract at very beginning level, so can you tell me how can I check it? (I use a Linux version of tesseract...) Thanks, Salvo. Il giorno giovedì 16 ottobre 2014 21:46:31 UTC+2, zdenop ha scritto: fl is recognizes as ligature in English, so there could be the same issue in Italian. If it is replaced with '?' I would guess you have problem with unicode... Can you check it with tesseract executable? Zdenko On Thu, Oct 16, 2014 at 10:18 AM, Salvo Piazza s.pi...@tsc-consulting.com wrote: Hi all, I've written a little simple program to extract text from image with tesseract 3.0.2 as: Tesseract instance = Tesseract.getInstance(); instance.setDatapath(currentDir); instance.setLanguage(ita); String returner = instance.doOCR(new File(filename)); It works fine but I've many question mark chars '?' in the extracted text. For example the word *fluidi *is recognized as *?uidi *and much more example... Does anyone know some tips in order to fix this behaviour? Thanks in advance, Salvo. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com. To post to this group, send email to tesser...@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/ msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282% 40googlegroups.com https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1ab08bc3-b6aa-45a7-bdfd-a6c0402e950f%40googlegroups.com https://groups.google.com/d/msgid/tesseract-ocr/1ab08bc3-b6aa-45a7-bdfd-a6c0402e950f%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xARq_qH_UQftQUzFZPOb_-4-_LGmQOEXWu1x%3D61CdiNw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
[tesseract-ocr] Many 'question mark' chars in recognized text
Hi all, I've written a little simple program to extract text from image with tesseract 3.0.2 as: Tesseract instance = Tesseract.getInstance(); instance.setDatapath(currentDir); instance.setLanguage(ita); String returner = instance.doOCR(new File(filename)); It works fine but I've many question mark chars '?' in the extracted text. For example the word *fluidi *is recognized as *?uidi *and much more example... Does anyone know some tips in order to fix this behaviour? Thanks in advance, Salvo. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [tesseract-ocr] Many 'question mark' chars in recognized text
Many OCR programs have trouble with ligatures. On Oct 16, 2014 11:21 AM, Salvo Piazza s.pia...@tsc-consulting.com wrote: Hi all, I've written a little simple program to extract text from image with tesseract 3.0.2 as: Tesseract instance = Tesseract.getInstance(); instance.setDatapath(currentDir); instance.setLanguage(ita); String returner = instance.doOCR(new File(filename)); It works fine but I've many question mark chars '?' in the extracted text. For example the word *fluidi *is recognized as *?uidi *and much more example... Does anyone know some tips in order to fix this behaviour? Thanks in advance, Salvo. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CA%2BOX7tr6zWdceD6REe_YrxAyzV-jsc40a5fteUD7cBo%2BOmX6Tg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.