Re: [tesseract-ocr] lines dissappear in resulting file

2015-01-09 Thread ShreeDevi Kumar
you should *uninstall the old version fully* and then build the version from git. It is possibly referring to some older libraries. Also, this needs leptonica 1.71. Not sure if the documentation mentions it or not. ShreeDevi भजन -

Re: [tesseract-ocr] lines dissappear in resulting file

2015-01-09 Thread ShreeDevi Kumar
please see https://code.google.com/p/tesseract-ocr/issues/detail?id=1278 ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, Jan 9, 2015 at 5:44 PM, ShreeDevi Kumar shreesh...@gmail.com wrote: you should *uninstall

Re: [tesseract-ocr] lines dissappear in resulting file

2015-01-08 Thread ShreeDevi Kumar
I am using the git version -- output and messages attached. pdf seems to have all the lines. User@HP ~/tesseract-ocr/testing $ tesseract 5.tif 5 pdf Tesseract Open Source OCR Engine v3.04.00 with Leptonica Page 1 OSD: Weak margin (5.78), horiz textlines, not CJK: Don't rotate. Page 2 Too few

Re: [tesseract-ocr] lines dissappear in resulting file

2015-01-08 Thread ShreeDevi Kumar
I don't think that's the supposed behavior. What version of tesseract are you using? Please post a sample image for testing? ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Jan 8, 2015 at 8:00 PM, C.

Re: [tesseract-ocr] Very wrong output Tessnet2 + Tesseract

2015-01-03 Thread ShreeDevi Kumar
see http://stackoverflow.com/questions/15067651/cannot-find-a-way-to-make-tessnet2-work tessnet2 is .NET wrapper for Tesseract 2.04 Try newer versions - say from https://github.com/charlesw/tesseract ShreeDevi भजन - कीर्तन - आरती @

Re: [tesseract-ocr] Odd behavior when trying to force a box to split

2015-01-01 Thread ShreeDevi Kumar
I think you need to deskew/dewarp the lines, increase brighness, get the imaes at 300dpi and try. I tested using your images with vietocr (4.0 beta) with the following output ... -- East 133rd Street, cast from Cypress Ave. In the background is the United Electric Light and

Re: [tesseract-ocr] tesseract 3 pdf error

2014-12-13 Thread ShreeDevi Kumar
Which version of source have you used? Latest version is available from https://code.google.com/p/tesseract-ocr/source/checkout You need the pdf config files in tessdata directory. See https://code.google.com/p/tesseract-ocr/source/browse/tessdata You also need to make sure that tessdata_prefix

Re: [tesseract-ocr] Re: Searchable PDF output with oversized font

2014-11-25 Thread ShreeDevi Kumar
Hi Chris, I opened the pdfs in Adobe Reader as well as Foxit Reader on Windows7, and the page flickers with large size text but then seems to display normally - zoom 100% also seems to be regular output only. Tesseract now has a 'pdf' option, so you don't need to do 'hocrpdf'. Try the following:

Re: [tesseract-ocr] Re: Searchable PDF output with oversized font

2014-11-23 Thread ShreeDevi Kumar
Have you tried with version compiled from latest source on git? If you post a couple of sample images I can give a try and let you know what results I get. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sun, Nov 23,

Re: [tesseract-ocr] Training data gets worse as I add characters

2014-11-21 Thread ShreeDevi Kumar
Hi, Have you added the fonts to font-properties file? Try removing the 'narrow' font from your training set. Test with just one or two similar fonts and see if results are better. ShreeDevi भजन - कीर्तन - आरती @

Re: [tesseract-ocr] Covering ASCII Extended range.

2014-11-21 Thread ShreeDevi Kumar
. On Wed, Nov 19, 2014 at 7:47 PM, ShreeDevi Kumar shreesh...@gmail.com wrote: Training 2 files ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Nov 20, 2014 at 9:15 AM, ShreeDevi Kumar shreesh...@gmail.com

Re: [tesseract-ocr] Re: jTessBoxEditor 0.6 Beta release

2014-11-20 Thread ShreeDevi Kumar
I have not used Serak - but the issues page there indicates problems with RTL languages - see https://code.google.com/p/serak-tesseract-trainer/issues/detail?id=6 why are u not using jtessbox editor's trainer or the command line programs? I think the binaries are bundled with JTess...

[tesseract-ocr] is it possible to use the latest source from git to train Arabic?

2014-11-20 Thread ShreeDevi Kumar
here. Question: m i giving the wrong file in the path in Tesseract executable and Training data i.e ara box file? or what goes wrong. note: i have put no data words_list, frequent_words, font_properties file. On 20 November 2014 17:32, ShreeDevi Kumar shreesh...@gmail.com wrote: I have

Re: [tesseract-ocr] Configure for single character recognition

2014-11-15 Thread ShreeDevi Kumar
take a look at hocr output and tsv option from https://code.google.com/r/email-hocr-tsv/ ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sat, Nov 15, 2014 at 3:39 PM, Simon Støvring simonstoevr...@gmail.com wrote: I

Re: [tesseract-ocr] मराठी ओसीआर

2014-11-14 Thread ShreeDevi Kumar
Amarjeet, Glad that you are getting 70-80% correct OCR for Marathi using the Konkani traineddata I posted. The Hindi traineddata was trained with 'cube' method by Google but that is not available to us. The training can be improved with better training text or font similar to the one being

Re: [tesseract-ocr] Configure for single character recognition

2014-11-14 Thread ShreeDevi Kumar
Have you tried with the existing english traineddata? I get good recognition with your 'prepared-image'? If that is the kind of image you need to OCR, you could do that with psm 6 and then split each letter separately? ShreeDevi भजन -

Re: [tesseract-ocr] Reading Device labels to get model number

2014-11-13 Thread ShreeDevi Kumar
Straighten the image before sending to tesseract. You can use scantailor or unpaper. Imagemagick may also have an option, you'll have to look. See attached images - output from scantailor - and then OCRed using Vietocr (gui frontend to Tesseract) MODEL NAME 7 MOORE RF28HMEDBSR ml.“ | mt

Re: [tesseract-ocr] What are the possible output file extensions?

2014-11-13 Thread ShreeDevi Kumar
.txt .pdf .hocr pdf and hocr can be passed as CONFIG file options when using tesseract from commandline and txt output is created automatically (in both cases, I think) This is with the latest version of tesseract from git. ShreeDevi

Re: [tesseract-ocr] Covering ASCII Extended range.

2014-11-13 Thread ShreeDevi Kumar
asc traineddata does not have a wordlist or dictionary, so using eng will help with that. Also, I just trained using a few fonts that support the whole range. If you train with the font you are using, you will get better results. You can use 'combine_tessdata' command with the -u (unpack) option

Re: [tesseract-ocr] Exception in thread main java.lang.UnsatisfiedLinkError: liblept.so.4: Cannot load Shared-Object

2014-11-12 Thread ShreeDevi Kumar
You need leptonica 1.71 for the current version of tesseract. liblept.so.4 ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, Nov 12, 2014 at 5:05 PM, Patrick Vöhrs voe...@wesoma-consulting.com wrote: Hi at all,

Re: [tesseract-ocr] Exception in thread main java.lang.UnsatisfiedLinkError: liblept.so.4: Cannot load Shared-Object

2014-11-12 Thread ShreeDevi Kumar
Have you seen http://tess4j.sourceforge.net/ - A Java JNA wrapper for Tesseract OCR API. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, Nov 12, 2014 at 6:18 PM, ShreeDevi Kumar shreesh...@gmail.com wrote: You

Re: [tesseract-ocr] Re: Train Tesseract to Only Find a Single 17 Character Word

2014-11-12 Thread ShreeDevi Kumar
]; On Wed, Nov 12, 2014 at 12:30 AM, ShreeDevi Kumar shreesh...@gmail.com wrote: Are you able to pass a configuration variable with iOS CocoaPod ? *-c configvar=value* Set value for control parameter. Multiple -c arguments are allowed. *configfile* The name of a config to use. A config

Re: [tesseract-ocr] Re: Train Tesseract to Only Find a Single 17 Character Word

2014-11-12 Thread ShreeDevi Kumar
, ShreeDevi Kumar shreesh...@gmail.com wrote: bazaar is nothing but a config file which sets values for a set of config variables, please see https://code.google.com/p/tesseract-ocr/source/browse/tessdata/configs/bazaar So, if patterns are helpful, you can that as a config. ShreeDevi

Re: [tesseract-ocr] Covering ASCII Extended range.

2014-11-12 Thread ShreeDevi Kumar
You can look at the unicharset of the traineddata to see the coverage. try with eng+deu+iast iast is a traineddata that I generated for sanskrit transliteration in roman/latin script. https://code.google.com/r/shreeshrii-langdata/source/browse/iast.unicharset?name=iast

Re: [tesseract-ocr] Re: jTessBoxEditor - Tesseract box editor trainer

2014-11-11 Thread ShreeDevi Kumar
JTessBoxEditor has three tabs Use *Tiff/Box Generator* to generate tiff and box files from a given text file for the chosen font The Box files created by Box/Tiff Generator are based on the rendering of the text in the chosen font and will be accurate - however they may still get errors 'blob

Re: [tesseract-ocr] Re: 6od instead of God

2014-11-11 Thread ShreeDevi Kumar
Please attach a copy of the image so that I can try. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, Nov 11, 2014 at 9:43 PM, misonis...@gmail.com wrote: I was in PSM_SINGLE_LINE mode indeed, because my text is

Re: [tesseract-ocr] Train Tesseract to Only Find a Single 17 Character Word

2014-11-11 Thread ShreeDevi Kumar
Have you tested with the English traineddata from the git tessdata repo? Please see https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html try with these, /path/to/eng.user-patterns: 1-\d\d\d-GOOG-411 www.\n\\\*.com I haven't tried this personally though ShreeDevi

Re: [tesseract-ocr] Re: jTessBoxEditor - Tesseract box editor trainer

2014-11-11 Thread ShreeDevi Kumar
You don't need to train in order to extract text. Have you tried with the english traineddata .. available from https://code.google.com/p/tesseract-ocr/source/browse/?repo=tessdata ShreeDevi भजन - कीर्तन - आरती @

Re: [tesseract-ocr] Train Tesseract to Only Find a Single 17 Character Word

2014-11-11 Thread ShreeDevi Kumar
also see https://groups.google.com/forum/#!topic/tesseract-ocr/et7bS5QRf2o ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, Nov 11, 2014 at 11:02 PM, ShreeDevi Kumar shreesh...@gmail.com wrote: Have you tested

Re: [tesseract-ocr] Re: 6od instead of God

2014-11-11 Thread ShreeDevi Kumar
You need to pre-process the image so that G shows up correctly. In the attached image G looks like a 6 as it is connected. If that is the shape of G in the font and you need to OCR it, you may either need to retrain or post-process the text. You could also try with a newer version of program.

Re: [tesseract-ocr] Re: 6od instead of God

2014-11-11 Thread ShreeDevi Kumar
I checked with vietocr beta4, which uses newer version of tesseract - it recognizes your tiff correctly. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, Nov 12, 2014 at 8:12 AM, ShreeDevi Kumar shreesh

Re: [tesseract-ocr] Re: Train Tesseract to Only Find a Single 17 Character Word

2014-11-11 Thread ShreeDevi Kumar
, as the final version of what I'm using will be using an iOS CocoaPod that does not support the bazaar functionality of Tesseract. On Tue, Nov 11, 2014 at 8:51 PM, ShreeDevi Kumar shreesh...@gmail.com wrote: On Wed, Nov 12, 2014 at 2:13 AM, ste...@fortyau.com wrote: The user-patterns looks

Re: [tesseract-ocr] Re: jTessBoxEditor 0.6 Beta release

2014-11-10 Thread ShreeDevi Kumar
Look under jtessboxeditor/samples/vie folder and create similar files for your language ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Nov 10, 2014 at 1:10 PM, iram akbar iramakb...@gmail.com wrote: Quan, i

Re: [tesseract-ocr] Training Tesseract Can't Find Files

2014-11-10 Thread ShreeDevi Kumar
What method are you using for training? Which version of tesseract? What platform? Please see instructions on https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 The following shell script will be useful, if using the latest source from git.

Re: [tesseract-ocr] Support Language

2014-11-08 Thread ShreeDevi Kumar
See https://groups.google.com/forum/?utm_medium=emailutm_source=footer#!topic/tesseract-dev/8e0F2cK2YzU for Plans for 3.04 release For Training Instructions, please see https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

Re: [tesseract-ocr] Support Language

2014-11-07 Thread ShreeDevi Kumar
Please see https://code.google.com/p/tesseract-ocr/source/browse/?repo=langdata#git%2Fkat Language codesISO 639-1 http://en.wikipedia.org/wiki/ISO_639-1kaISO 639-2 http://en.wikipedia.org/wiki/ISO_639-2geo http://www.sil.org/iso639-3/documentation.asp?id=geo (B) kat

Re: [tesseract-ocr] Re: Tesseract 3.02.02 Released

2014-11-07 Thread ShreeDevi Kumar
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, Nov 7, 2014 at 4:26 PM, iram akbar iramakb...@gmail.com wrote: Hi, i want to make my own tessdata

Re: [tesseract-ocr] Re: Tesseract 3.02.02 Released

2014-11-07 Thread ShreeDevi Kumar
Also see https://drive.google.com/folderview?id=0B7l10Bj_LprhQnpSRkpGMGV2eE0usp=sharing tutorial files for overview ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, Nov 7, 2014 at 5:04 PM, ShreeDevi Kumar shreesh

Re: [tesseract-ocr] Support Georgian Language

2014-11-07 Thread ShreeDevi Kumar
CC:ing Ray and Dev group That language data is part of the update done by Ray Smith on August 12. Ray is planning an update to language data and traineddata soon, so if you have suggestions for improvement, please file an issue and provide more details, samples of each script, etc.. ShreeDevi

Re: [tesseract-ocr] Re: jTessBoxEditor 0.6 Beta release

2014-11-06 Thread ShreeDevi Kumar
Please also change the FONT under TRAINER tab to Arabic . ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Nov 6, 2014 at 2:49 PM, iram akbar iramakb...@gmail.com wrote: i have downloaded the lates version 1.1

Re: [tesseract-ocr] Reducing the generated PDF size / compression PDF

2014-11-06 Thread ShreeDevi Kumar
You could also test with gswin32c -q -dNOPAUSE -dBATCH -sDEVICE=tiffgray -sCompression=lzw -r300 ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Nov 6, 2014 at 2:13 PM, Sébastien Cuendet

Re: [tesseract-ocr] Re: jTessBoxEditor 0.6 Beta release

2014-11-06 Thread ShreeDevi Kumar
Click on the 'generate' box - with some devanagri fonts I have found that text does not display but the tiff/box are generated. Maybe same for the arabic font you are using. Give it a try. You can also try to copy and paste the text, sometimes that works. ShreeDevi

Re: [tesseract-ocr] Re: jTessBoxEditor 0.6 Beta release

2014-11-06 Thread ShreeDevi Kumar
​I think you are using the wrong tools ... If you need to convert a jpg to tif, use an image editor such as imagemagick, irfanview If you need to OCR the image, tesseract accepts jpg as input as well as tif There already is arabic traineddata for tesseract - see

Re: [tesseract-ocr] Reading dot matrix characters

2014-11-05 Thread ShreeDevi Kumar
I had asked to try vietocr because it is using a newer svn version for the java 4.0beta and I find it easy to test under windows with the gui, as I can change the image filter settings in it. You will have to choose the tools based on your platform and other requirements. You could use

Re: [tesseract-ocr] How to run make training for Repo installed Tesseract 3.03

2014-11-05 Thread ShreeDevi Kumar
Did you install the latest version from http://packages.ubuntu.com/utopic/tesseract-ocr If so, it should have the trainingtools. Try which text2image to see if it installed ShreeDevi भजन - कीर्तन - आरती @

Re: [tesseract-ocr] Reading dot matrix characters

2014-11-05 Thread ShreeDevi Kumar
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, Nov 5, 2014 at 4:57 PM, ShreeDevi Kumar shreesh...@gmail.com wrote: I had asked to try vietocr because it is using a newer svn version for the java 4.0beta and I find it easy to test under windows with the gui, as I can

Re: [tesseract-ocr] Re: Adding new language to Tesseract?

2014-11-03 Thread ShreeDevi Kumar
There already is language data for srp - please see https://code.google.com/p/tesseract-ocr/source/browse/srp/?repo=langdata and https://code.google.com/p/tesseract-ocr/source/browse/srp.traineddata?repo=tessdata Ray Smith, the lead developer of tesseract at Google is planning to release

Re: [tesseract-ocr] Re: Adding new language to Tesseract?

2014-11-03 Thread ShreeDevi Kumar
Thanks for clarifying and giving more details. I am cc:ing this email to the tesseract developers group and Ray for answer to your question how to submit this file to Tesseract's repository?. Meanwhile, I suggest that you add an 'issue' and attach the traineddata. Thanks! ShreeDevi

[tesseract-ocr] Re: Contribution : Serbian Cyrillic traineddata file

2014-11-03 Thread ShreeDevi Kumar
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, Nov 4, 2014 at 7:35 AM, ShreeDevi Kumar shreesh...@gmail.com wrote: Thanks for clarifying and giving more details. I am cc:ing this email to the tesseract developers group and Ray

Re: [tesseract-ocr] default mode PSM

2014-11-01 Thread ShreeDevi Kumar
http://manpages.ubuntu.com/manpages/precise/man1/tesseract.1.html *tesseract* *imagename* *outbase* [*-l* *lang*] [*-psm* *N*] [*configfile* ...] ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sat, Nov 1, 2014 at

Re: [tesseract-ocr] default mode PSM

2014-11-01 Thread ShreeDevi Kumar
Updated version of man page is at https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sat, Nov 1, 2014 at 4:19 PM, ShreeDevi Kumar shreesh...@gmail.com

Re: [tesseract-ocr] Re: any chance to get this .tiff converted to text?

2014-10-31 Thread ShreeDevi Kumar
In VietOCR's image menu, check 'screenshot mode' Use the filters submenu to experiment with other settings to improve your image. Look under properties for the dpi, convert your input images to 300dpi as they are currently low res (72dpi or so). experiment :-) ShreeDevi

Re: [tesseract-ocr] Strange regocnition

2014-10-31 Thread ShreeDevi Kumar
change image to 300 dpi try vietocr - in screenshot mode - try with the vietnamese traineddata with commandline tesseract use 'digits' config file as parameter recognizing only numbers is actually answered on the tesseract FAQ http://code.google.com/p/tesseract-ocr/wiki/FAQ

Re: [tesseract-ocr] Re: any chance to get this .tiff converted to text?

2014-10-30 Thread ShreeDevi Kumar
Do look at https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality for pre-processing steps for your images to improve recognition regardless of the OCR you use. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed,

Re: [tesseract-ocr] Re: any chance to get this .tiff converted to text?

2014-10-29 Thread ShreeDevi Kumar
Please choose german in the dropdown for language on right hand side. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, Oct 29, 2014 at 9:08 PM, boris borisri...@gmail.com wrote: Hi Shree, many thanks for your

Re: [tesseract-ocr] any chance to get this .tiff converted to text?

2014-10-28 Thread ShreeDevi Kumar
I was going to suggest the tips from https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality but, just OCRing the image without any changes in VietOCR (GUI frontend for tesseract) with German traineddata gives perfect result - see image. What version are you using, on what platform, ?? I

Re: [tesseract-ocr] Reading dot matrix characters

2014-10-23 Thread ShreeDevi Kumar
Try .net wrapper with newer version of tesseract. invert the image, smoothen/blur, make greyscale ... I tried with vietocr output is 'QBCDEFGHIJKL' ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Oct 23, 2014 at

Re: [tesseract-ocr] Reading dot matrix characters

2014-10-23 Thread ShreeDevi Kumar
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Oct 23, 2014 at 12:24 PM, ShreeDevi Kumar shreesh...@gmail.com wrote: Try .net wrapper with newer version of tesseract. invert the image, smoothen/blur, make greyscale ... I tried

Re: [tesseract-ocr] how can I get better results for this

2014-10-17 Thread ShreeDevi Kumar
https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality ​try with image at 300dpi or higher. resize 300%​ ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, Oct 17, 2014 at 8:35 PM, Rick Leir rich...@c7a.ca

Re: [tesseract-ocr] how can I get better results for this

2014-10-17 Thread ShreeDevi Kumar
You have to experiment .. I got better results after some image processing and vietocr .. that it has bcln dooi transfer of a portzon which has been leased an. M- nan-ant.‘ 0n Mu [image: Inline image 1] ShreeDevi भजन - कीर्तन -

Re: [tesseract-ocr] Tessdata for marathi

2014-10-16 Thread ShreeDevi Kumar
Marathi traineddata should be in the next release, since there is langdata for it now in the repo. You can give a try to the traineddata file from https://code.google.com/r/shreeshrii-tessdata/source/browse?name=knn which is a start for konkani. ShreeDevi

<    3   4   5   6   7   8