date:20140919

[tesseract-ocr] Re: Improve recognize russian chars

2014-09-19 Thread bulkinvk

Did not help... 

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8b2af9a4-c473-4594-9711-5dfdab9ace0e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Improve recognize russian chars

2014-09-19 Thread bulkinvk

i  should enlarge picture (x3)?
Or enlarge dpi on scanner?

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3ea4b208-b8e5-4330-b490-4645b75c0532%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Improve recognize russian chars

2014-09-19 Thread Shree Devi Kumar

Enlarge dpi on scanner to at least 300dpi. pre-process the image.

see tips given at
https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality

for a test, i saved a screenshot from wikipedia page in russian.
Attached is the image and its output, and also from a blurred version of
same image.
The output from your image is also attached.

I am using the compiled version of the latest source from git on windows8
under msys2.

Shree Devi Kumar

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Sep 19, 2014 at 9:27 AM, bulki...@gmail.com wrote:

Did not help...

--
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an
email to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/8b2af9a4-c473-4594-9711-5dfdab9ace0e%40googlegroups.com
https://groups.google.com/d/msgid/tesseract-ocr/8b2af9a4-c473-4594-9711-5dfdab9ace0e%40googlegroups.com?utm_medium=emailutm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXMft5NX-fH1UnCgrrin0C%2B2700cK9HE2mDXDkVonBncQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
ОТДЕЛЕНИЕ уж Ш

«Аполлон-17» (англ Ара/Ю т —11—и и паспедний
пилотируемый полет в рамках программы «Аполлон», в ходе
которого была осуществлена шестая высадка пюдей на Луну
Это была третья джеимиссия (англ „Ниш/дн) с акцентом на
научные исследования в экипаж корабля впервые вошел
ученый—профессионал, геопагХаррисон Шмитг в
распоряжении астронавтов так же, как и в ходе двух
предшествовавших зкспедици‘, бып лунный автомобилы «Лун
Команднослужебный модуль «Апаппона—П» имел позывные «
мадупь — иЧеппенджер»

«Апатит-17» (нит Ара/ю т— 1 и и последний
пилотируемый полет в рамках программы «Аполлон», в ходе
которого была осуществлена шестая высадка людей на Луну
Это была третья Мй—мисоля (амтл „Ат/шёл} с акцентом на
научные исследования В экипаж шрабпя шервыв вошёл
учёный—профессионал, теолог Харрисон Шмитт в
распоряжалии астронавтов так же, как и в ходе двух
лредлюавшаалмх экспедиций, был пуииый автомобипь‘ «Луи
Комаьшощпужебиый модуль «Апошкжа—П» имал позывные «
модуль — «Чеппеишкерх

[tesseract-ocr] Re: Need help reg pre-processing of image before ocr

2014-09-19 Thread Shree Devi Kumar

Do you still need a copy of sanskrit traineddata ?

Shree Devi Kumar

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Aug 23, 2013 at 10:21 PM, mns_rao mns...@gmail.com wrote:

Hi,
The result output of OCR also depends on traineddata file of the language
of the input image. If you have a good traineddata file for sanskrit you
can use FreeOCR 4.2(http://www.paperfile.net/) by adding it in the
settings--open language folder and pasting it there. FreeOCR 4.2 does the
entire PDF book (input at 'open PDF' ) at one click OCR--ocr all pages.
Try with original book first and if not satisfaactory convert cleaned
images into PDF book again
I also need sanskrit traineddata file if you can spare it..
Wishing success,
MNS Rao

On Friday, 23 August 2013 18:38:44 UTC+5:30, shree wrote:

I
want to OCR a sanskrit book available as a pdf.

I used gsview to save all pages as png and
then used scantailor to deskew the images which saved them as tifs.
Then I used irfanview to apply blur and median filters as the text is
very grainy in the original and also resized the page to a smaller size.

The pre-processed image as above is giving better result than original.

I would like to know if there is a simpler/better method to pre-process
the image. The pdf is 500+ pages.

I am attaching a single page from the pdf and the processed image file.

Thnaks,
Shree

--
--
You received this message because you are subscribed to the Google
Groups tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

---
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an
email to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVy3xw6fi8K2%3DcDVyWSHwUnksRGgdU2a9HEVXRuoCT5aQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: How to get paragraph wise text in Tesseract ?

2014-09-19 Thread Satya Swaroop

Yes,but it did not solve my issue.

On Thursday, September 18, 2014 10:21:39 PM UTC+5:30, Albrecht Hilker wrote:

 Did you try

 SetPageSegMode(PSM_AUTO)


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/82ab0af0-9caa-444c-bed3-22c802216f52%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Tesseract recognizes the characters irrespective of the lines

2014-09-19 Thread Satya Swaroop

I am also facing the same problem.Please post your answer once you find it.

Thanks in advance


On Tuesday, September 9, 2014 6:58:15 PM UTC+5:30, Dineshkumar wrote:

 What steps will reproduce the problem?
 1. Run the Tesseract OCR in Java for the attached image 
 2. Save the OCR result in a text file
 3. Check the order of the output text file with the attached image.
 What is the expected output? What do you see instead?
 Expected output -- Expected the result with words in the horizontal left to 
 right order.

 Actual output   -- Showing words randomly irrespective of the line order.
 What version of the product are you using? On what operating system?
 Tesseract 3.01 and Windows 7 
 Please provide any additional information below.
 The input and expected  actual output are attached for reference.



-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bc6d7aa5-2e48-4cfe-81c0-d7fa73aa0e6e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Modification of background image allowed in PDF output?

2014-09-19 Thread zdenko podobny

This is known issue - try current code from git repository. It should be
fixed.

Zdenko

On Fri, Sep 19, 2014 at 2:38 PM, Frank Siegert frank.sieg...@googlemail.com
wrote:

Dear all,

I have been testing tesseract to embed OCR in scanned PDF documents, and
it works phenomenally well in recognizing the text.

Now I noticed one slightly disturbing issue just by chance when comparing
the original input image and the PDF file: A number of straight lines that
are present in the input image have disappeared completely in the PDF (some
of the are horizontal rules, others are lines in a logo). Since I wanted to
use tesseract to produce completely unmodified documents with only the OCR
text layer added, this would be a problem for me. I have uploaded a test
image for this to http://cern.ch/fsiegert/tmp/tesseract-test.tif and here
is the command I used on it:

$ tesseract -l deu tesseract-test.tif tesseract-test pdf
Tesseract Open Source OCR Engine v3.03 with Leptonica
OSD: Weak margin (6.96) for 162 blob text block, but using orientation
anyway: 1
$ tesseract --version
tesseract 3.03
leptonica-1.71
libgif 5.1.0 : libjpeg 8d : libpng 1.6.12 : libtiff 4.0.3 : zlib 1.2.8
: libwebp 0.4.1

This results in http://cern.ch/fsiegert/tmp/tesseract-test.pdf, which is
missing the straight horizontal lines and the ones in the logo. Is this
line-removal done on purpose and can it be disabled?

Cheers,
Frank

PS: I have removed much more text from the document for privacy reasons,
but the same happens when the document is complete with text.

--
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an
email to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com
https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com?utm_medium=emailutm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wbwxtiFKmhFr0pLrCwq_-Qy48gJcJxyU9ug3%2BSy1040A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Modification of background image allowed in PDF output?

2014-09-19 Thread Frank Siegert

Dear Zdenko,

Thanks for the quick reply!

Does that mean in general, i.e. except for this bug, that I can by
construction assume the image will remain unmodified and only a text layer
added?

Cheers,
Frank

On Friday, September 19, 2014 2:54:52 PM UTC+2, zdenop wrote:

This is known issue - try current code from git repository. It should be
fixed.

Zdenko

On Fri, Sep 19, 2014 at 2:38 PM, Frank Siegert frank@googlemail.com
javascript: wrote:

Dear all,

I have been testing tesseract to embed OCR in scanned PDF documents, and
it works phenomenally well in recognizing the text.

This results in http://cern.ch/fsiegert/tmp/tesseract-test.pdf, which is
missing the straight horizontal lines and the ones in the logo. Is this
line-removal done on purpose and can it be disabled?

Cheers,
Frank

PS: I have removed much more text from the document for privacy reasons,
but the same happens when the document is complete with text.

--
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an
email to tesseract-oc...@googlegroups.com javascript:.
To post to this group, send email to tesser...@googlegroups.com
javascript:.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com

https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com?utm_medium=emailutm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/c8596569-abef-4e93-9193-0121ac8737fd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Modification of background image allowed in PDF output?

2014-09-19 Thread zdenko podobny

Well yes and no ;-)
Yes - there should be no change on image, but no - you need to expect
that (re)compression of input image by pdf renderer could take a place. See
comments for issue 1285[1] for more details.

[1] https://code.google.com/p/tesseract-ocr/issues/detail?id=1285

Zdenko

On Fri, Sep 19, 2014 at 3:14 PM, Frank Siegert frank.sieg...@googlemail.com
wrote:

Dear Zdenko,

Thanks for the quick reply!

Does that mean in general, i.e. except for this bug, that I can by
construction assume the image will remain unmodified and only a text layer
added?

Cheers,
Frank

On Friday, September 19, 2014 2:54:52 PM UTC+2, zdenop wrote:

This is known issue - try current code from git repository. It should be
fixed.

Zdenko

On Fri, Sep 19, 2014 at 2:38 PM, Frank Siegert frank@googlemail.com
wrote:

Dear all,

I have been testing tesseract to embed OCR in scanned PDF documents, and
it works phenomenally well in recognizing the text.

Now I noticed one slightly disturbing issue just by chance when
comparing the original input image and the PDF file: A number of straight
lines that are present in the input image have disappeared completely in
the PDF (some of the are horizontal rules, others are lines in a logo).
Since I wanted to use tesseract to produce completely unmodified documents
with only the OCR text layer added, this would be a problem for me. I have
uploaded a test image for this to http://cern.ch/fsiegert/tmp/
tesseract-test.tif and here is the command I used on it:

This results in http://cern.ch/fsiegert/tmp/tesseract-test.pdf, which
is missing the straight horizontal lines and the ones in the logo. Is this
line-removal done on purpose and can it be disabled?

Cheers,
Frank

PS: I have removed much more text from the document for privacy reasons,
but the same happens when the document is complete with text.

--
You received this message because you are subscribed to the Google
Groups tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send
an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/
msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%
40googlegroups.com
https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com?utm_medium=emailutm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an
email to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/c8596569-abef-4e93-9193-0121ac8737fd%40googlegroups.com
https://groups.google.com/d/msgid/tesseract-ocr/c8596569-abef-4e93-9193-0121ac8737fd%40googlegroups.com?utm_medium=emailutm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wwxCfjOwo729LhT_wtOUJbx7DmqVfvcMkF27bO5dFjQQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] version 3.04

2014-09-19 Thread Rick Leir

Ubuntu 14.04 has tess 3.03 and lept 1.70.

I compiled tess 3.04 and lept 1.71, and installed them 
   (and ran ldconfig so the new libraries would get used).

Is it ok to use the old tessdata from 3.03 that was installed from the 
Ubuntu package?  I start tess with
$ TESSDATA_PREFIX=/usr/share/tesseract-ocr/ tesseract ofilename ifilename 
quiet hocr

It seems to work fine, but maybe the training data needs to be freshened.

Thanks
Rick

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/21cff7d9-9473-4316-aaea-6dafa65f90bb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] version 3.04

2014-09-19 Thread zdenko podobny

There is no tesseract 3.04 - so you can not install it.
Your question indicates that you do not understand consequences of your
action, so I strongly suggest you to revert to last stable release which is
3.02.02.

Zdenko

On Fri, Sep 19, 2014 at 8:31 PM, Rick Leir rich...@c7a.ca wrote:

Ubuntu 14.04 has tess 3.03 and lept 1.70.

I compiled tess 3.04 and lept 1.71, and installed them
(and ran ldconfig so the new libraries would get used).

Is it ok to use the old tessdata from 3.03 that was installed from the
Ubuntu package? I start tess with
$ TESSDATA_PREFIX=/usr/share/tesseract-ocr/ tesseract ofilename ifilename
quiet hocr

It seems to work fine, but maybe the training data needs to be freshened.

Thanks
Rick

--
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an
email to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/21cff7d9-9473-4316-aaea-6dafa65f90bb%40googlegroups.com
https://groups.google.com/d/msgid/tesseract-ocr/21cff7d9-9473-4316-aaea-6dafa65f90bb%40googlegroups.com?utm_medium=emailutm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xjRP1%2Bppt50yGO-FNOn29VWzV0wHk5V7sCL3adtY2rGA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Improve recognize russian chars

[tesseract-ocr] Re: Improve recognize russian chars

Re: [tesseract-ocr] Re: Improve recognize russian chars

[tesseract-ocr] Re: Need help reg pre-processing of image before ocr

[tesseract-ocr] Re: How to get paragraph wise text in Tesseract ?

[tesseract-ocr] Re: Tesseract recognizes the characters irrespective of the lines

Re: [tesseract-ocr] Modification of background image allowed in PDF output?

Re: [tesseract-ocr] Modification of background image allowed in PDF output?

Re: [tesseract-ocr] Modification of background image allowed in PDF output?

[tesseract-ocr] version 3.04

Re: [tesseract-ocr] version 3.04

11 matches

Site Navigation

Mail list logo

Footer information