[tesseract-ocr] Re: Improve recognize russian chars

2014-09-19 Thread bulkinvk
Did not help... 

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8b2af9a4-c473-4594-9711-5dfdab9ace0e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Improve recognize russian chars

2014-09-19 Thread bulkinvk
i  should enlarge picture (x3)?
Or enlarge dpi on scanner?

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3ea4b208-b8e5-4330-b490-4645b75c0532%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Improve recognize russian chars

2014-09-19 Thread Shree Devi Kumar
Enlarge dpi on scanner to at least 300dpi. pre-process the image.

see tips given at
https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality

for   a test, i saved a screenshot from wikipedia page in russian.
Attached is the image and its output, and also from a blurred version of
same image.
The output from your image is also attached.

I am using the compiled version of the latest source from git on windows8
under msys2.


Shree Devi Kumar

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Sep 19, 2014 at 9:27 AM, bulki...@gmail.com wrote:

 Did not help...

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to tesseract-ocr+unsubscr...@googlegroups.com.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 Visit this group at http://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/tesseract-ocr/8b2af9a4-c473-4594-9711-5dfdab9ace0e%40googlegroups.com
 https://groups.google.com/d/msgid/tesseract-ocr/8b2af9a4-c473-4594-9711-5dfdab9ace0e%40googlegroups.com?utm_medium=emailutm_source=footer
 .

 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXMft5NX-fH1UnCgrrin0C%2B2700cK9HE2mDXDkVonBncQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
ОТДЕЛЕНИЕ уж Ш

«Аполлон-17» (англ Ара/Ю т —11—и и паспедний
пилотируемый полет в рамках программы «Аполлон», в ходе
которого была осуществлена шестая высадка пюдей на Луну
Это была третья джеимиссия (англ „Ниш/дн) с акцентом на
научные исследования в экипаж корабля впервые вошел
ученый—профессионал, геопагХаррисон Шмитг в
распоряжении астронавтов так же, как и в ходе двух
предшествовавших зкспедици‘, бып лунный автомобилы «Лун
Команднослужебный модуль «Апаппона—П» имел позывные «
мадупь — иЧеппенджер»

«Апатит-17» (нит Ара/ю т— 1 и и последний
пилотируемый полет в рамках программы «Аполлон», в ходе
которого была осуществлена шестая высадка людей на Луну
Это была третья Мй—мисоля (амтл „Ат/шёл} с акцентом на
научные исследования В экипаж шрабпя шервыв вошёл
учёный—профессионал, теолог Харрисон Шмитт в
распоряжалии астронавтов так же, как и в ходе двух
лредлюавшаалмх экспедиций, был пуииый автомобипь‘ «Луи
Комаьшощпужебиый модуль «Апошкжа—П» имал позывные «
модуль — «Чеппеишкерх



[tesseract-ocr] Re: Need help reg pre-processing of image before ocr

2014-09-19 Thread Shree Devi Kumar
Do you still need a copy of sanskrit traineddata ?

Shree Devi Kumar

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Aug 23, 2013 at 10:21 PM, mns_rao mns...@gmail.com wrote:

 Hi,
 The result output of OCR also depends on traineddata file of the language
 of the input image. If you have a good traineddata file for sanskrit you
 can use FreeOCR 4.2(http://www.paperfile.net/) by adding it in the
 settings--open language folder and pasting it there. FreeOCR 4.2 does the
 entire PDF book (input at 'open PDF' ) at one click OCR--ocr all pages.
 Try with original book first and if not satisfaactory convert cleaned
 images into PDF book again
  I also need sanskrit traineddata file if you can spare it..
 Wishing success,
 MNS Rao


 On Friday, 23 August 2013 18:38:44 UTC+5:30, shree wrote:

 I
 want to OCR a sanskrit book available as a pdf.

 I used gsview to save all pages as png and
 then used scantailor to deskew the images which saved them as tifs.
 Then I used irfanview to apply blur and median filters as the text is
 very grainy in the original and also resized the page to a smaller size.

 The pre-processed image as above is giving better result than original.

 I would like to know if there is a simpler/better method to pre-process
 the image. The pdf is 500+ pages.

 I am attaching a single page from the pdf and the processed image file.

 Thnaks,
 Shree

  --
 --
 You received this message because you are subscribed to the Google
 Groups tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en

 ---
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/groups/opt_out.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVy3xw6fi8K2%3DcDVyWSHwUnksRGgdU2a9HEVXRuoCT5aQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: How to get paragraph wise text in Tesseract ?

2014-09-19 Thread Satya Swaroop
Yes,but it did not solve my issue.

On Thursday, September 18, 2014 10:21:39 PM UTC+5:30, Albrecht Hilker wrote:

 Did you try

 SetPageSegMode(PSM_AUTO)


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/82ab0af0-9caa-444c-bed3-22c802216f52%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Tesseract recognizes the characters irrespective of the lines

2014-09-19 Thread Satya Swaroop
I am also facing the same problem.Please post your answer once you find it.

Thanks in advance


On Tuesday, September 9, 2014 6:58:15 PM UTC+5:30, Dineshkumar wrote:

 What steps will reproduce the problem?
 1. Run the Tesseract OCR in Java for the attached image 
 2. Save the OCR result in a text file
 3. Check the order of the output text file with the attached image.
 What is the expected output? What do you see instead?
 Expected output -- Expected the result with words in the horizontal left to 
 right order.

 Actual output   -- Showing words randomly irrespective of the line order.
 What version of the product are you using? On what operating system?
 Tesseract 3.01 and Windows 7 
 Please provide any additional information below.
 The input and expected  actual output are attached for reference.



-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bc6d7aa5-2e48-4cfe-81c0-d7fa73aa0e6e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Modification of background image allowed in PDF output?

2014-09-19 Thread zdenko podobny
This is known issue - try current code from git repository. It should be
fixed.

Zdenko

On Fri, Sep 19, 2014 at 2:38 PM, Frank Siegert frank.sieg...@googlemail.com
 wrote:

 Dear all,

 I have been testing tesseract to embed OCR in scanned PDF documents, and
 it works phenomenally well in recognizing the text.

 Now I noticed one slightly disturbing issue just by chance when comparing
 the original input image and the PDF file: A number of straight lines that
 are present in the input image have disappeared completely in the PDF (some
 of the are horizontal rules, others are lines in a logo). Since I wanted to
 use tesseract to produce completely unmodified documents with only the OCR
 text layer added, this would be a problem for me. I have uploaded a test
 image for this to http://cern.ch/fsiegert/tmp/tesseract-test.tif and here
 is the command I used on it:

 $ tesseract -l deu tesseract-test.tif tesseract-test pdf
 Tesseract Open Source OCR Engine v3.03 with Leptonica
 OSD: Weak margin (6.96) for 162 blob text block, but using orientation
 anyway: 1
 $ tesseract --version
 tesseract 3.03
  leptonica-1.71
   libgif 5.1.0 : libjpeg 8d : libpng 1.6.12 : libtiff 4.0.3 : zlib 1.2.8
 : libwebp 0.4.1


 This results in http://cern.ch/fsiegert/tmp/tesseract-test.pdf, which is
 missing the straight horizontal lines and the ones in the logo. Is this
 line-removal done on purpose and can it be disabled?

 Cheers,
 Frank

 PS: I have removed much more text from the document for privacy reasons,
 but the same happens when the document is complete with text.

  --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to tesseract-ocr+unsubscr...@googlegroups.com.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 Visit this group at http://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com
 https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wbwxtiFKmhFr0pLrCwq_-Qy48gJcJxyU9ug3%2BSy1040A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Modification of background image allowed in PDF output?

2014-09-19 Thread Frank Siegert
Dear Zdenko,

Thanks for the quick reply!

Does that mean in general, i.e. except for this bug, that I can by 
construction assume the image will remain unmodified and only a text layer 
added?

Cheers,
Frank


On Friday, September 19, 2014 2:54:52 PM UTC+2, zdenop wrote:

 This is known issue - try current code from git repository. It should be 
 fixed.

 Zdenko

 On Fri, Sep 19, 2014 at 2:38 PM, Frank Siegert frank@googlemail.com 
 javascript: wrote:

 Dear all,

 I have been testing tesseract to embed OCR in scanned PDF documents, and 
 it works phenomenally well in recognizing the text.

 Now I noticed one slightly disturbing issue just by chance when comparing 
 the original input image and the PDF file: A number of straight lines that 
 are present in the input image have disappeared completely in the PDF (some 
 of the are horizontal rules, others are lines in a logo). Since I wanted to 
 use tesseract to produce completely unmodified documents with only the OCR 
 text layer added, this would be a problem for me. I have uploaded a test 
 image for this to http://cern.ch/fsiegert/tmp/tesseract-test.tif and 
 here is the command I used on it:

 $ tesseract -l deu tesseract-test.tif tesseract-test pdf
 Tesseract Open Source OCR Engine v3.03 with Leptonica
 OSD: Weak margin (6.96) for 162 blob text block, but using orientation 
 anyway: 1
 $ tesseract --version
 tesseract 3.03
  leptonica-1.71
   libgif 5.1.0 : libjpeg 8d : libpng 1.6.12 : libtiff 4.0.3 : zlib 1.2.8 
 : libwebp 0.4.1


 This results in http://cern.ch/fsiegert/tmp/tesseract-test.pdf, which is 
 missing the straight horizontal lines and the ones in the logo. Is this 
 line-removal done on purpose and can it be disabled?

 Cheers,
 Frank

 PS: I have removed much more text from the document for privacy reasons, 
 but the same happens when the document is complete with text.

  -- 
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to tesseract-oc...@googlegroups.com javascript:.
 To post to this group, send email to tesser...@googlegroups.com 
 javascript:.
 Visit this group at http://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com
  
 https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.




-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c8596569-abef-4e93-9193-0121ac8737fd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Modification of background image allowed in PDF output?

2014-09-19 Thread zdenko podobny
Well yes and no ;-)
Yes - there should be no change on image, but no - you need to expect
that (re)compression of input image by pdf renderer could take a place. See
comments for issue 1285[1] for more details.

[1] https://code.google.com/p/tesseract-ocr/issues/detail?id=1285

Zdenko

On Fri, Sep 19, 2014 at 3:14 PM, Frank Siegert frank.sieg...@googlemail.com
 wrote:

 Dear Zdenko,

 Thanks for the quick reply!

 Does that mean in general, i.e. except for this bug, that I can by
 construction assume the image will remain unmodified and only a text layer
 added?

 Cheers,
 Frank


 On Friday, September 19, 2014 2:54:52 PM UTC+2, zdenop wrote:

 This is known issue - try current code from git repository. It should be
 fixed.

 Zdenko

 On Fri, Sep 19, 2014 at 2:38 PM, Frank Siegert frank@googlemail.com
 wrote:

 Dear all,

 I have been testing tesseract to embed OCR in scanned PDF documents, and
 it works phenomenally well in recognizing the text.

 Now I noticed one slightly disturbing issue just by chance when
 comparing the original input image and the PDF file: A number of straight
 lines that are present in the input image have disappeared completely in
 the PDF (some of the are horizontal rules, others are lines in a logo).
 Since I wanted to use tesseract to produce completely unmodified documents
 with only the OCR text layer added, this would be a problem for me. I have
 uploaded a test image for this to http://cern.ch/fsiegert/tmp/
 tesseract-test.tif and here is the command I used on it:

 $ tesseract -l deu tesseract-test.tif tesseract-test pdf
 Tesseract Open Source OCR Engine v3.03 with Leptonica
 OSD: Weak margin (6.96) for 162 blob text block, but using orientation
 anyway: 1
 $ tesseract --version
 tesseract 3.03
  leptonica-1.71
   libgif 5.1.0 : libjpeg 8d : libpng 1.6.12 : libtiff 4.0.3 : zlib
 1.2.8 : libwebp 0.4.1


 This results in http://cern.ch/fsiegert/tmp/tesseract-test.pdf, which
 is missing the straight horizontal lines and the ones in the logo. Is this
 line-removal done on purpose and can it be disabled?

 Cheers,
 Frank

 PS: I have removed much more text from the document for privacy reasons,
 but the same happens when the document is complete with text.

  --
 You received this message because you are subscribed to the Google
 Groups tesseract-ocr group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at http://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit https://groups.google.com/d/
 msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%
 40googlegroups.com
 https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.


  --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to tesseract-ocr+unsubscr...@googlegroups.com.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 Visit this group at http://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/tesseract-ocr/c8596569-abef-4e93-9193-0121ac8737fd%40googlegroups.com
 https://groups.google.com/d/msgid/tesseract-ocr/c8596569-abef-4e93-9193-0121ac8737fd%40googlegroups.com?utm_medium=emailutm_source=footer
 .

 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wwxCfjOwo729LhT_wtOUJbx7DmqVfvcMkF27bO5dFjQQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] version 3.04

2014-09-19 Thread Rick Leir
Ubuntu 14.04 has tess 3.03 and lept 1.70.

I compiled tess 3.04 and lept 1.71, and installed them 
   (and ran ldconfig so the new libraries would get used).

Is it ok to use the old tessdata from 3.03 that was installed from the 
Ubuntu package?  I start tess with
$ TESSDATA_PREFIX=/usr/share/tesseract-ocr/ tesseract ofilename ifilename 
quiet hocr

It seems to work fine, but maybe the training data needs to be freshened.

Thanks
Rick

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/21cff7d9-9473-4316-aaea-6dafa65f90bb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] version 3.04

2014-09-19 Thread zdenko podobny
There is no tesseract 3.04 - so you can not install it.
Your question indicates that you do not understand consequences of your
action, so I strongly suggest you to revert to last stable release which is
3.02.02.


Zdenko

On Fri, Sep 19, 2014 at 8:31 PM, Rick Leir rich...@c7a.ca wrote:

 Ubuntu 14.04 has tess 3.03 and lept 1.70.

 I compiled tess 3.04 and lept 1.71, and installed them
(and ran ldconfig so the new libraries would get used).

 Is it ok to use the old tessdata from 3.03 that was installed from the
 Ubuntu package?  I start tess with
 $ TESSDATA_PREFIX=/usr/share/tesseract-ocr/ tesseract ofilename ifilename
 quiet hocr

 It seems to work fine, but maybe the training data needs to be freshened.

 Thanks
 Rick

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to tesseract-ocr+unsubscr...@googlegroups.com.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 Visit this group at http://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/tesseract-ocr/21cff7d9-9473-4316-aaea-6dafa65f90bb%40googlegroups.com
 https://groups.google.com/d/msgid/tesseract-ocr/21cff7d9-9473-4316-aaea-6dafa65f90bb%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xjRP1%2Bppt50yGO-FNOn29VWzV0wHk5V7sCL3adtY2rGA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.