[tesseract-ocr] Re: Improve recognize russian chars
Did not help... -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8b2af9a4-c473-4594-9711-5dfdab9ace0e%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[tesseract-ocr] Re: Improve recognize russian chars
i should enlarge picture (x3)? Or enlarge dpi on scanner? -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3ea4b208-b8e5-4330-b490-4645b75c0532%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [tesseract-ocr] Re: Improve recognize russian chars
Enlarge dpi on scanner to at least 300dpi. pre-process the image. see tips given at https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality for a test, i saved a screenshot from wikipedia page in russian. Attached is the image and its output, and also from a blurred version of same image. The output from your image is also attached. I am using the compiled version of the latest source from git on windows8 under msys2. Shree Devi Kumar भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, Sep 19, 2014 at 9:27 AM, bulki...@gmail.com wrote: Did not help... -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8b2af9a4-c473-4594-9711-5dfdab9ace0e%40googlegroups.com https://groups.google.com/d/msgid/tesseract-ocr/8b2af9a4-c473-4594-9711-5dfdab9ace0e%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXMft5NX-fH1UnCgrrin0C%2B2700cK9HE2mDXDkVonBncQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout. ОТДЕЛЕНИЕ уж Ш «Аполлон-17» (англ Ара/Ю т —11—и и паспедний пилотируемый полет в рамках программы «Аполлон», в ходе которого была осуществлена шестая высадка пюдей на Луну Это была третья джеимиссия (англ „Ниш/дн) с акцентом на научные исследования в экипаж корабля впервые вошел ученый—профессионал, геопагХаррисон Шмитг в распоряжении астронавтов так же, как и в ходе двух предшествовавших зкспедици‘, бып лунный автомобилы «Лун Команднослужебный модуль «Апаппона—П» имел позывные « мадупь — иЧеппенджер» «Апатит-17» (нит Ара/ю т— 1 и и последний пилотируемый полет в рамках программы «Аполлон», в ходе которого была осуществлена шестая высадка людей на Луну Это была третья Мй—мисоля (амтл „Ат/шёл} с акцентом на научные исследования В экипаж шрабпя шервыв вошёл учёный—профессионал, теолог Харрисон Шмитт в распоряжалии астронавтов так же, как и в ходе двух лредлюавшаалмх экспедиций, был пуииый автомобипь‘ «Луи Комаьшощпужебиый модуль «Апошкжа—П» имал позывные « модуль — «Чеппеишкерх
[tesseract-ocr] Re: Need help reg pre-processing of image before ocr
Do you still need a copy of sanskrit traineddata ? Shree Devi Kumar भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, Aug 23, 2013 at 10:21 PM, mns_rao mns...@gmail.com wrote: Hi, The result output of OCR also depends on traineddata file of the language of the input image. If you have a good traineddata file for sanskrit you can use FreeOCR 4.2(http://www.paperfile.net/) by adding it in the settings--open language folder and pasting it there. FreeOCR 4.2 does the entire PDF book (input at 'open PDF' ) at one click OCR--ocr all pages. Try with original book first and if not satisfaactory convert cleaned images into PDF book again I also need sanskrit traineddata file if you can spare it.. Wishing success, MNS Rao On Friday, 23 August 2013 18:38:44 UTC+5:30, shree wrote: I want to OCR a sanskrit book available as a pdf. I used gsview to save all pages as png and then used scantailor to deskew the images which saved them as tifs. Then I used irfanview to apply blur and median filters as the text is very grainy in the original and also resized the page to a smaller size. The pre-processed image as above is giving better result than original. I would like to know if there is a simpler/better method to pre-process the image. The pdf is 500+ pages. I am attaching a single page from the pdf and the processed image file. Thnaks, Shree -- -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVy3xw6fi8K2%3DcDVyWSHwUnksRGgdU2a9HEVXRuoCT5aQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
[tesseract-ocr] Re: How to get paragraph wise text in Tesseract ?
Yes,but it did not solve my issue. On Thursday, September 18, 2014 10:21:39 PM UTC+5:30, Albrecht Hilker wrote: Did you try SetPageSegMode(PSM_AUTO) -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/82ab0af0-9caa-444c-bed3-22c802216f52%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[tesseract-ocr] Re: Tesseract recognizes the characters irrespective of the lines
I am also facing the same problem.Please post your answer once you find it. Thanks in advance On Tuesday, September 9, 2014 6:58:15 PM UTC+5:30, Dineshkumar wrote: What steps will reproduce the problem? 1. Run the Tesseract OCR in Java for the attached image 2. Save the OCR result in a text file 3. Check the order of the output text file with the attached image. What is the expected output? What do you see instead? Expected output -- Expected the result with words in the horizontal left to right order. Actual output -- Showing words randomly irrespective of the line order. What version of the product are you using? On what operating system? Tesseract 3.01 and Windows 7 Please provide any additional information below. The input and expected actual output are attached for reference. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/bc6d7aa5-2e48-4cfe-81c0-d7fa73aa0e6e%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [tesseract-ocr] Modification of background image allowed in PDF output?
This is known issue - try current code from git repository. It should be fixed. Zdenko On Fri, Sep 19, 2014 at 2:38 PM, Frank Siegert frank.sieg...@googlemail.com wrote: Dear all, I have been testing tesseract to embed OCR in scanned PDF documents, and it works phenomenally well in recognizing the text. Now I noticed one slightly disturbing issue just by chance when comparing the original input image and the PDF file: A number of straight lines that are present in the input image have disappeared completely in the PDF (some of the are horizontal rules, others are lines in a logo). Since I wanted to use tesseract to produce completely unmodified documents with only the OCR text layer added, this would be a problem for me. I have uploaded a test image for this to http://cern.ch/fsiegert/tmp/tesseract-test.tif and here is the command I used on it: $ tesseract -l deu tesseract-test.tif tesseract-test pdf Tesseract Open Source OCR Engine v3.03 with Leptonica OSD: Weak margin (6.96) for 162 blob text block, but using orientation anyway: 1 $ tesseract --version tesseract 3.03 leptonica-1.71 libgif 5.1.0 : libjpeg 8d : libpng 1.6.12 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.1 This results in http://cern.ch/fsiegert/tmp/tesseract-test.pdf, which is missing the straight horizontal lines and the ones in the logo. Is this line-removal done on purpose and can it be disabled? Cheers, Frank PS: I have removed much more text from the document for privacy reasons, but the same happens when the document is complete with text. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wbwxtiFKmhFr0pLrCwq_-Qy48gJcJxyU9ug3%2BSy1040A%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
Re: [tesseract-ocr] Modification of background image allowed in PDF output?
Dear Zdenko, Thanks for the quick reply! Does that mean in general, i.e. except for this bug, that I can by construction assume the image will remain unmodified and only a text layer added? Cheers, Frank On Friday, September 19, 2014 2:54:52 PM UTC+2, zdenop wrote: This is known issue - try current code from git repository. It should be fixed. Zdenko On Fri, Sep 19, 2014 at 2:38 PM, Frank Siegert frank@googlemail.com javascript: wrote: Dear all, I have been testing tesseract to embed OCR in scanned PDF documents, and it works phenomenally well in recognizing the text. Now I noticed one slightly disturbing issue just by chance when comparing the original input image and the PDF file: A number of straight lines that are present in the input image have disappeared completely in the PDF (some of the are horizontal rules, others are lines in a logo). Since I wanted to use tesseract to produce completely unmodified documents with only the OCR text layer added, this would be a problem for me. I have uploaded a test image for this to http://cern.ch/fsiegert/tmp/tesseract-test.tif and here is the command I used on it: $ tesseract -l deu tesseract-test.tif tesseract-test pdf Tesseract Open Source OCR Engine v3.03 with Leptonica OSD: Weak margin (6.96) for 162 blob text block, but using orientation anyway: 1 $ tesseract --version tesseract 3.03 leptonica-1.71 libgif 5.1.0 : libjpeg 8d : libpng 1.6.12 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.1 This results in http://cern.ch/fsiegert/tmp/tesseract-test.pdf, which is missing the straight horizontal lines and the ones in the logo. Is this line-removal done on purpose and can it be disabled? Cheers, Frank PS: I have removed much more text from the document for privacy reasons, but the same happens when the document is complete with text. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com javascript:. To post to this group, send email to tesser...@googlegroups.com javascript:. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c8596569-abef-4e93-9193-0121ac8737fd%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [tesseract-ocr] Modification of background image allowed in PDF output?
Well yes and no ;-) Yes - there should be no change on image, but no - you need to expect that (re)compression of input image by pdf renderer could take a place. See comments for issue 1285[1] for more details. [1] https://code.google.com/p/tesseract-ocr/issues/detail?id=1285 Zdenko On Fri, Sep 19, 2014 at 3:14 PM, Frank Siegert frank.sieg...@googlemail.com wrote: Dear Zdenko, Thanks for the quick reply! Does that mean in general, i.e. except for this bug, that I can by construction assume the image will remain unmodified and only a text layer added? Cheers, Frank On Friday, September 19, 2014 2:54:52 PM UTC+2, zdenop wrote: This is known issue - try current code from git repository. It should be fixed. Zdenko On Fri, Sep 19, 2014 at 2:38 PM, Frank Siegert frank@googlemail.com wrote: Dear all, I have been testing tesseract to embed OCR in scanned PDF documents, and it works phenomenally well in recognizing the text. Now I noticed one slightly disturbing issue just by chance when comparing the original input image and the PDF file: A number of straight lines that are present in the input image have disappeared completely in the PDF (some of the are horizontal rules, others are lines in a logo). Since I wanted to use tesseract to produce completely unmodified documents with only the OCR text layer added, this would be a problem for me. I have uploaded a test image for this to http://cern.ch/fsiegert/tmp/ tesseract-test.tif and here is the command I used on it: $ tesseract -l deu tesseract-test.tif tesseract-test pdf Tesseract Open Source OCR Engine v3.03 with Leptonica OSD: Weak margin (6.96) for 162 blob text block, but using orientation anyway: 1 $ tesseract --version tesseract 3.03 leptonica-1.71 libgif 5.1.0 : libjpeg 8d : libpng 1.6.12 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.1 This results in http://cern.ch/fsiegert/tmp/tesseract-test.pdf, which is missing the straight horizontal lines and the ones in the logo. Is this line-removal done on purpose and can it be disabled? Cheers, Frank PS: I have removed much more text from the document for privacy reasons, but the same happens when the document is complete with text. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com. To post to this group, send email to tesser...@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/ msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0% 40googlegroups.com https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c8596569-abef-4e93-9193-0121ac8737fd%40googlegroups.com https://groups.google.com/d/msgid/tesseract-ocr/c8596569-abef-4e93-9193-0121ac8737fd%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wwxCfjOwo729LhT_wtOUJbx7DmqVfvcMkF27bO5dFjQQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
[tesseract-ocr] version 3.04
Ubuntu 14.04 has tess 3.03 and lept 1.70. I compiled tess 3.04 and lept 1.71, and installed them (and ran ldconfig so the new libraries would get used). Is it ok to use the old tessdata from 3.03 that was installed from the Ubuntu package? I start tess with $ TESSDATA_PREFIX=/usr/share/tesseract-ocr/ tesseract ofilename ifilename quiet hocr It seems to work fine, but maybe the training data needs to be freshened. Thanks Rick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/21cff7d9-9473-4316-aaea-6dafa65f90bb%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [tesseract-ocr] version 3.04
There is no tesseract 3.04 - so you can not install it. Your question indicates that you do not understand consequences of your action, so I strongly suggest you to revert to last stable release which is 3.02.02. Zdenko On Fri, Sep 19, 2014 at 8:31 PM, Rick Leir rich...@c7a.ca wrote: Ubuntu 14.04 has tess 3.03 and lept 1.70. I compiled tess 3.04 and lept 1.71, and installed them (and ran ldconfig so the new libraries would get used). Is it ok to use the old tessdata from 3.03 that was installed from the Ubuntu package? I start tess with $ TESSDATA_PREFIX=/usr/share/tesseract-ocr/ tesseract ofilename ifilename quiet hocr It seems to work fine, but maybe the training data needs to be freshened. Thanks Rick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/21cff7d9-9473-4316-aaea-6dafa65f90bb%40googlegroups.com https://groups.google.com/d/msgid/tesseract-ocr/21cff7d9-9473-4316-aaea-6dafa65f90bb%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xjRP1%2Bppt50yGO-FNOn29VWzV0wHk5V7sCL3adtY2rGA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.