from:"Nick White"

Re: [tesseract-ocr] What is the state of the C and Python APIs?

2018-08-07 Thread Nick White

Hi Luke, On Mon, Aug 06, 2018 at 02:12:38PM -0700, Luke Brandl wrote: > I've been working to understand Tesseract and looking through the C and Python > API code and documentation. It looks like some of the code and documentation > are up to date, while the rest refers to 3.0.2 at least in the

Re: [tesseract-ocr] Modyfying existing traineddata

2016-02-23 Thread Nick White

Hi Devon, On Mon, Feb 22, 2016 at 10:43:33AM -0800, Devon Yoo wrote: > I have test set that only has "uppercase English alphabets" and "numbers". But > the provided eng.traineddata returns symbols and lower case alphabets > sometimes. Is there a way to modify the existing traineddata file so

Re: [tesseract-ocr] Run Tesseract on linux without shared libraries

2016-01-21 Thread Nick White

Hi Łukasz, > Is it possible to run tesseract without setting up > LD_LIBRARY_PATH? Why don't you want to just use LD_LIBRARY_PATH? I suspect, to be honest, that it would be difficult to compile the leptonica library into the tesseract executable. It would be fun and interesting (to me) to

Re: [tesseract-ocr] Using plain makefiles for fun and profit (was: Run Tesseract on linux without shared libraries)

2016-01-21 Thread Nick White

So this email prompted me to try something a little crazy, but it worked; I just built a statically linked tesseract binary :) A long time ago I wrote some plain makefiles which didn't rely on any automake / cmake stuff. The main devs weren't interested, understandably, but it was useful and

Re: [tesseract-ocr] Tesseract for Tibetan

2015-11-25 Thread Nick White

Hi Yizhen, On Tue, Nov 24, 2015 at 07:08:24PM -0800, Yizhen Hai wrote: > I am working on a volunteer project to digitize the Sutra and all related > materials, most of them in Tibetan. Sounds like a great project :) > Therefore, I wonder how I can get help to use Tesseract for Tibetan. (I am

Re: [tesseract-ocr] how to use tesstrain .sh etc in ubuntu 15.10

2015-11-10 Thread Nick White

On Tue, Nov 10, 2015 at 08:59:19AM -0800, Ryan Baumann wrote: > Thanks for this, Nick. I'm just getting around to looking into moving my Latin > training into the tesstrain.sh system and this is very helpful. Great, I was planning to do that myself with your Latin training - let me know if you

Re: [tesseract-ocr] how to use tesstrain .sh etc in ubuntu 15.10

2015-11-09 Thread Nick White

tar.gz[x_8px] > Dear Nick, > Awaiting your valuable guidance.Kindly treat my request as SOS due to my > overaged factor of 83+yrs old. I want to enjoy the program. > With warmest reagards, > sriranga(83yrs) > > On Tue, Nov 3, 2015 at 12:19 AM, Nick White <nick.wh...@du

Re: [tesseract-ocr] how to use tesstrain .sh etc in ubuntu 15.10

2015-11-02 Thread Nick White

Hi Sriranga, > I find there three files of '.sh - viz. > 1) language-specific.sh. (My lang is "kan") > 2)tesstrain.sh > 3)tesstrain_utils.sh. > Request for the valuable guidance how to use above .sh files ( step by step I plan to write up some proper documentation on how to use these scripts

[tesseract-ocr] Ground truth files

2015-10-29 Thread Nick White

Hi all, I recently finally got around to organising and releasing some (well, a lot of) ground truth files for the language I have been training for ages now, Ancient Greek. By "ground truth" I mean real page scans with the corresponding (hand-typed) correct text, which is essential to be

Re: [tesseract-ocr] Ground truth files

2015-10-29 Thread Nick White

Just a note, all the .git URLs listed below are git repositories, and there isn't a web interface to them on my server, so just clone them directly like this: git clone http://ancientgreekocr.org/mignetools.git Nick On Thu, Oct 29, 2015 at 06:23:21PM +, Nick White wrote: > Hi

Re: [tesseract-ocr] Tesseract option 4 bug

2015-10-23 Thread Nick White

Hi Alfred, On Fri, Oct 23, 2015 at 01:11:55AM -0700, Alfred Puca wrote: > I sent an attachment with the error using program from command line with psm > option-4 Thanks for that. The first thing I notice is that you're using an old version of tesseract (3.02). Can you update to the latest

Re: [tesseract-ocr] Tesseract option 4 bug

2015-10-22 Thread Nick White

Hi Alfred, On Wed, Oct 21, 2015 at 01:16:22AM -0700, Alfred Puca wrote: > I'm having problems with psm option 4 (Assume a single column of text of > variable sizes). > It seems as a bug in the application. > How is it possible to use this option? What problems are you having? Can you give an

Re: [tesseract-ocr] How to install on Shared Hosting

2015-10-22 Thread Nick White

Hi Avinash, On Wed, Oct 21, 2015 at 01:40:35PM -0700, Avinash Mishra wrote: > I dont have VPS can anybody tell me how to install it on shared hosting The instructions for installing without root should be what you need:

Re: [tesseract-ocr] Tesseract 3.04 error.

2015-09-17 Thread Nick White

On Wed, Sep 16, 2015 at 10:16:40PM +0530, ShreeDevi Kumar wrote: > If you are having trouble using it with Java, Quan maybe able to suggest a > solution. I agree, this sounds more like a Java issue to me. I don't know Java at all, but if it's treating anything that sends output to stderr as

Re: [tesseract-ocr] Re: Why would this be? -> When I reinitialize tesseract for every call in a loop it consistently runs faster by a something like .1 second per loop iteration

2015-09-17 Thread Nick White

On Fri, Sep 11, 2015 at 12:13:02AM -0700, fsbo.cons...@gmail.com wrote: > To anyone else who may run across this, it is because of the way C++ uses > scope > to optimize the code when it compiles. Things that are within the scope of the > for loop will run faster than things that have larger

Re: [tesseract-ocr] Re: Easiest way to run Tesseract from a Mac

2015-08-21 Thread Nick White

On Fri, Aug 21, 2015 at 02:13:17PM +0100, Allistair wrote: This, I think, just illustrates there is no one-size-fits-all approach. All methods should be enumerated for installing Tesseract for Mac. I disagree. Mac OS X is a homogenous enough system that we ought to be able to do it right,

[tesseract-ocr] Small update on the tools I wrote

2015-04-30 Thread Nick White

Hi all, long time since I last posted here. This is just a little update about some training related tools I wrote a while ago, the 'tesstrainingtools' collection. It has largely been superceded by the training stuff that's included in Tesseract now, but maybe someone will still find it

Re: [tesseract-ocr] Re: Tesseract for recognition the international phonetic transcription

2015-04-30 Thread Nick White

even with specific training? Tom On Wednesday, January 22, 2014 at 11:55:28 AM UTC-5, Nick White wrote: Hi Epin, On Sat, Jan 18, 2014 at 01:32:11AM -0800, Epin Dorsal wrote: I've been looking for a soft means for recognition the international phonetic transcription

Re: [tesseract-ocr] Losing accuracy when training tessearct on fonts it already is trained on

2014-08-25 Thread Nick White

On Fri, Aug 22, 2014 at 12:42:21PM -0700, Thomas Bruno wrote: Is this common when training from text2image output? APPLY_BOXES: boxfile line 5364/748 ((1488,893),(1532,6)): FAILURE! Couldn't find a matching blob FAIL! Yes, there will be some of these. Check the proportion of failing to

Re: [tesseract-ocr] Can I remove some fonts from an existing traineddata?

2014-08-21 Thread Nick White

On Wed, Aug 20, 2014 at 07:39:50PM -0700, SHEN Fei wrote: hi Nick, I'm trying to use tesseract in my mobile phone so the tessdata size is critical. Since I only care about very few fonts, it would be convenient if I could add/ remove a special font. Maybe removing some dictionary files

Re: [tesseract-ocr] Makefile:372: recipe for target 'all' failed - using current version with leptonica 1.71 on cygwin

2014-08-21 Thread Nick White

On Thu, Aug 21, 2014 at 01:41:23PM +0530, Shree Devi Kumar wrote: Hi Zdenko, ./ confusing for me :-) :-) ./ is a common idiom for unix. '.' means 'current directory', so ./ means 'in the current directory'. You have to do it to run programs in the current directory (or just do something

Re: [tesseract-ocr] tesseract trained successfully but gives:Tesseract Open Source OCR Engine v3.03 with Leptonica Segmentation fault (core dumped)

2014-08-21 Thread Nick White

Hi Dovhani, Does this happen with all images when using your training, or just one? Nick On Thu, Aug 21, 2014 at 03:03:47AM -0700, Dovhani Foneworx wrote: Hi guys, I have a problem, I have succesfully trained tesseract 3.03 in Ubunt 14.04 but when i run tesseract it is giving errors on an

Re: [tesseract-ocr] tesseract trained successfully but gives:Tesseract Open Source OCR Engine v3.03 with Leptonica Segmentation fault (core dumped)

2014-08-21 Thread Nick White

, Aug 21, 2014 at 4:03 PM, Nick White nick.wh...@durham.ac.uk wrote: Hi Dovhani, Does this happen with all images when using your training, or just one? Nick On Thu, Aug 21, 2014 at 03:03:47AM -0700, Dovhani Foneworx wrote: Hi guys, I have a problem, I have

Re: [tesseract-ocr] Re: Tesseract compilation on code blocks (gcc + mingw)

2014-08-21 Thread Nick White

On Thu, Aug 21, 2014 at 11:29:09AM -0700, shree wrote: zdenko, the current problem also seems related to strtok_r please see http://stackoverflow.com/questions/12973750/ fatal-error-strtok-r-h-no-such-file-or-directory-while-compiling-tesseract-oc

Re: [tesseract-ocr] Can I remove some fonts from an existing traineddata?

2014-08-20 Thread Nick White

Hi Shen, On Wed, Aug 20, 2014 at 01:10:30AM -0700, SHEN Fei wrote: Can I remove some fonts from an existing traineddata file? For example, if I only need 2 or 3 comon fonts of default eng.traineddata, is there a way to extract them out of the original file? No, I'm afraid not, not at the

Re: [tesseract-ocr] Losing accuracy when training tessearct on fonts it already is trained on

2014-08-20 Thread Nick White

Hi Thomas, On Mon, Aug 18, 2014 at 02:17:19PM -0700, Thomas Bruno wrote: Where can I find the box/tif combo for the eng.traineddata that Tessearct 3.02 provides for download? The tif/box files used to create the eng.traineddata for 3.02 are not available, and are very unlikely to be made so,

Re: [tesseract-ocr] Best image pre-processing software

2014-08-20 Thread Nick White

Hi Chris, On Wed, Aug 20, 2014 at 11:12:50AM -0700, Chris Smeal wrote: I've been doing some research on using Tesseract for both document scans and text in scenery, and I was wondering what image processors are best? Given I have a lot of images, I cannot process each batch by hand, so I will

Re: [tesseract-ocr] Re: How to disable image pre-processing?

2014-08-13 Thread Nick White

On Wed, Aug 13, 2014 at 08:39:06AM -0700, Oliver Nicolini wrote: A little up, I can't find any doc for this topic. If anyone can help that would be fantastic. Did you read Paul's reply? Tesseract only does binarisation. If you don't want it to do that, binarise your image before passing it

Re: [tesseract-ocr] Error when running make - scanutils.cpp:38:14: error: typedef redefinition with different types ('long' vs '__darwin_off_t' (aka 'long long'))

2014-08-12 Thread Nick White

On Tue, Aug 12, 2014 at 12:58:23PM +0530, Shree Devi Kumar wrote: On Tue, Aug 12, 2014 at 4:31 AM, testing1234 cory.hix...@gmail.com wrote: Note.. Step 5 above the last command should be sudo make install-langs Nick, it maybe helpful to add/update instructions in wiki. Cory

Re: [tesseract-ocr] Outreach from the Wikisource community

2014-08-12 Thread Nick White

Dear Wikisourcerers, It's good to hear from you. Wikisource is awesome, as far as I am concerned. One of the most serious issues was raised by the Belarusian community which uses 2 different scripts with no commercial OCR support. This means that the volunteers have to type each word

Re: [tesseract-ocr] Re: Trying to understand custom dictionaries

2014-08-12 Thread Nick White

On Thu, Jul 24, 2014 at 05:53:56AM -0700, Victoria A. wrote: From my experience, seeing that Tesseract's English training data can recognize words that are NOT contained in the dictionary, I suppose Tesseract only uses the custom dictionary for hints instead of only knowing the words in the

Re: [tesseract-ocr] Passing RegEx to Zone Scans

2014-08-12 Thread Nick White

Hi David, You're right, that would be useful. Tesseract has a basic version of that, called patterns; search the manpage for a bit of information on them. However at present they can't be assigned per region, only as possible patterns for the whole OCR job. Also they aren't restrictive, but

Re: [tesseract-ocr] I compiled and installed tesseract from the source on CentOS. I kept both 3.01 and 3.02 versions. I use environment path stored in bash file to point to the version in use.

2014-08-06 Thread Nick White

On Tue, Jul 22, 2014 at 11:48:21PM +0200, zdenko podobny wrote: If you want to have several version of tesseract (e.g. you want to compare OCR result) I would suggest you to compile them from source (e.g. in /usr/src) and not installed them. If you want to test particular version you can run

Re: [tesseract-ocr] Not getting accuracy with Arabic font

2014-08-06 Thread Nick White

Hi Prashant, On Wed, Aug 06, 2014 at 01:32:54AM -0700, Prashant Mahskey wrote: I am using tesseract for my android app with arabic language. I've copied all the files required from the language files download page. I've tried with gray scaling and cropping extra blank part from the

Re: [tesseract-ocr] OCR using C

2014-08-06 Thread Nick White

Hi Rara, On Thu, Jul 31, 2014 at 08:29:51AM -0700, Rara wrote: I'm searching of a detailed guide for developpement with Tesseract and a tuto explained how to use and test this platform with windows OS. Looking forward to your answer ! There is an example program using the C API here:

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-08-06 Thread Nick White

Hi Albrecht, Sorry for not replying sooner, I've been away. Nevertheless I read a post from Ray where he says that he receives millions of emails and the last thing he likes to do is writing long texts (email responses or documentations). I think this is a fatal situation, because if he

Re: [tesseract-ocr] Failed to get the text

2014-08-06 Thread Nick White

Hi Fajar, Looks like you should try binarising the image yourself prior to handing it over to Tesseract. Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to

Re: [tesseract-ocr] Re: Get Tesseract ocr to ignore or replace images with whitespace

2014-08-06 Thread Nick White

Hi Richard, On Sun, Jul 20, 2014 at 01:51:32PM -0700, Richard Arnold wrote: Stroke Width Transform looks very interesting. However, I have some questions regarding its use in what I'm doing. I'm writing a Desktop application and OpenOCR appears to use a web service call?? Stroke Width

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-08-06 Thread Nick White

On Wed, Aug 06, 2014 at 08:50:27PM +0530, Shree Devi Kumar wrote: My current plan for documentation is as follows: - Rewrite and simplify TrainingTesseract3 on the wiki - Write manpages for each tool in training/ - Document how each training file is used, and

Re: [tesseract-ocr] what does width= right -left = no silly +1/-1 mean in this tutorial?

2014-07-17 Thread Nick White

On Wed, Jul 16, 2014 at 11:17:00PM -0700, Jing JC wrote: I am going through Ray Smith's tutorial, and don't get it? He means that as the co-ordinate system uses bottom left as the origin, you will never get a minus number co-ordinate (as you could if the origin was elsewhere). -- You

Re: [tesseract-ocr] JTessbox Modifying the boxes

2014-07-17 Thread Nick White

On Thu, Jul 17, 2014 at 12:14:43AM -0700, Jing JC wrote: The Ray's tutorial said the bounding box overlaps. so when I modify the box inside JTessbox, do I keep the overlapping boxes, or make the boxes non touching. That's interesting, actually; I didn't realise Tesseract did outlining

Re: [tesseract-ocr] questions when reading unicharset manual: https://tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharset.5.html

2014-07-15 Thread Nick White

Hi, The part you aren't reading closely enough from the manual page is: properties An integer mask of character properties, one per bit. From least to most significant bit, these are: isalpha, islower, isupper, isdigit, ispunctuation. So ; has ispunctuation set, but none of the

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-07-15 Thread Nick White

Hi Albrecht, On Mon, Jul 14, 2014 at 01:10:07PM -0700, Albrecht Hilker wrote: When I download the traineddata files and extract the unicharset file from them I notice that some are extremely different from the ones on SVN in the folder training/langdata. For example: Bengali, Hebrew,

Re: [tesseract-ocr] Missing detailed documentation about Unicharset files

2014-07-15 Thread Nick White

Hi again, On Mon, Jul 14, 2014 at 09:38:26AM -0700, Albrecht Hilker wrote: After some days I came back here and I'm very surprised about your lots of posts. Thanks for answering and taking the time. As you may have noticed, there aren't too many people around here who are comfortable

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-07-15 Thread Nick White

Sorry for the noise. I've looked into this more, and discovered more :) On Tue, Jul 15, 2014 at 10:54:06AM -0400, Nick White wrote: On Mon, Jul 14, 2014 at 01:10:07PM -0700, Albrecht Hilker wrote: When I download the traineddata files and extract the unicharset file from them I notice

Re: [tesseract-ocr] How to find the font properties

2014-07-15 Thread Nick White

Hi Mustak, On Tue, Jul 15, 2014 at 03:14:35AM -0700, Mustak M wrote: I am new to tesseract. I am using tesseract 3.2. I am able to retrieve the text from an image. And able to get the co-ordinates for each word with tesseract source.jpg output hocr command. Is there any command to retireve

Re: [tesseract-ocr] questions when reading unicharset manual: https://tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharset.5.html

2014-07-15 Thread Nick White

Hi, On Tue, Jul 15, 2014 at 10:04:24AM -0700, Jing JC wrote: yep yep. Thanks a lot Nick. I tried to cancel mu post last night. but seems I can not get access to it after posted but before approved. I tried to match the V2's example to V3's format. I figured it out later. No

Re: [tesseract-ocr] Re: Regarding Tesseract OCR engine for recognizing Tamil Fonts

2014-07-15 Thread Nick White

On Mon, Jul 14, 2014 at 11:36:46AM -0700, Paul wrote: Am Montag, 14. Juli 2014 10:07:59 UTC+2 schrieb sibi kanagaraj: But , I feel that Tamil Training is not sufficient and it could be streamlined . Hence I went to see if there are sufficient training documents for Tamil .

Re: [tesseract-ocr] Re: is tesseract 3.03's source tar available? need to compile on CentOS 5.6

2014-07-14 Thread Nick White

On Mon, Jul 14, 2014 at 07:38:19AM -0700, Christopher Smeenk wrote: I found the source for v3.03 here: http://packages.ubuntu.com/trusty/ tesseract-ocr The version called 3.03 in Ubuntu is an -rc - there is no official 3.03 release yet. As I understand it Ray Jeff called it 3.03 so that

Re: [tesseract-ocr] builing the svn source code in windows is too difficult.

2014-07-13 Thread Nick White

I build the tesseract svn source code in win8, I used the VS2013/Cygwin/MinGW to build this, all failed. Hi, you need to give us more clues as to why it failed. What error messages did you get? what version of leptonica the newest svn use? 1.70 or 1.71? Tesseract should work fine with

Re: [tesseract-ocr] is tesseract 3.03's source tar available? need to compile on CentOS 5.6

2014-07-13 Thread Nick White

On Sun, Jul 13, 2014 at 06:38:11PM +0430, universal reseller wrote: is google drive use tesseract 3.03 ? It's -rc1, meaning release candidate 1. So it isn't an official release, but rather a testing preview release, which should be to what the final 3.03 will be. i checked one english pdf

Re: [tesseract-ocr] Re: need help removing garbage characters from my OCR

2014-07-12 Thread Nick White

On Fri, Jul 11, 2014 at 03:06:29PM -0700, Alex Ryan wrote: I wrote some simple code to preprocess the image because I realized I will be doing basically the same image every time so its foolish to try and use Tesseracts binaziration technique which was designed for a different and more

Re: [tesseract-ocr] need help removing garbage characters from my OCR

2014-07-10 Thread Nick White

Hi Alex, One quick thought, if you're still using .uzn, it's only loaded with certain psm levels (it is with -psm 6, but not -psm 3, the default). And it's loaded from imagename_without_extension.uzn. So if you have any .uzn files lying around, they will be being applied with psm 6, but not

Re: [tesseract-ocr] Any way to prevent contextual digits-letters flipping ?

2014-07-10 Thread Nick White

Hi, I haven't tried it, but quickly grepping around the source code suggests setting the config variable crunch_include_numerals to true might do the job. Please let us know if that works. Nick On Wed, Jul 09, 2014 at 11:15:10PM -0700, Damien D wrote: Hi everyone, tesseract seems to

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-07-10 Thread Nick White

I'm just going to go through your numbered points here. On Fri, Jul 04, 2014 at 10:02:43AM -0700, Albrecht Hilker wrote: 1.) The column other_case should contain the ID of the other-case letter. For the lowercase letters they point correctly to the uppercase letters. But the uppercase letters

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-07-10 Thread Nick White

On Sat, Jul 05, 2014 at 03:34:05PM -0700, Albrecht Hilker wrote: Hello zdenop It is clear that you are not the right person to answer this question. If YOU would ever have looked into the source code you have seen that these values ARE in use (in version 3.03). You're being pretty unfair on

Re: [tesseract-ocr] Missing detailed documentation about Unicharset files

2014-07-10 Thread Nick White

I have more thoughts to the unicharset metrics discussion. So this example says that the character 1 has a min_bottom value of 59 and the character 9 has a min_bottom value of 18. Weird ? ? ? Both numbers are aligned to the baseline! I am guessing now (I'll take a look at the code later),

Re: [tesseract-ocr] need help removing garbage characters from my OCR

2014-07-10 Thread Nick White

On Tue, Jul 08, 2014 at 10:36:50PM -0700, Alex Ryan wrote: In one of the links tho I saw something about -psm setting. When I run the OCR with -psm 6 all of a sudden it worked perfect!!! Im really not sure what that setting does, ive tried doing some searches, but im still unclear. Can you

Re: [tesseract-ocr] Missing detailed documentation about Unicharset files

2014-07-10 Thread Nick White

V £ 4 9 Q A P ¢ ] 3 2 © 8 / X é j ; 7 € O ¥ U x } E § = ! ’ G ) Z q { “ — Y K * W \ ° ﬁ ‘ _ ﬂ /* * Copyright 2014 Nick White nick.wh...@durham.ac.uk * * Licensed under the Apache License, Version 2.0 (the License); * you may not use this file except in compliance with the License. * http

Re: [tesseract-ocr] need help removing garbage characters from my OCR

2014-07-09 Thread Nick White

On Wed, Jul 09, 2014 at 03:16:08AM -0700, Paul wrote: How about using ImageJ (can be automated with macros) to create a better binary result of the image. Thanks for mentioning this; I hadn't heard of it and it sounds very useful. I added a link to the ImproveQuality wiki page. Nick --

Re: [tesseract-ocr] Is there any influence of the input format of the image PNG vs TIFF

2014-07-09 Thread Nick White

On Wed, Jul 09, 2014 at 09:48:20AM -0700, Rani Yaroshinski wrote: From the point of view of the performance measures of the OCR ? I don't think anybody has figures on this. You could do some tests yourself, and let us know the results. I would guess that file size would be a bigger slowdown

Re: [tesseract-ocr] Is it wise to interfere with the pre-processing pipeline of Leptonica

2014-07-09 Thread Nick White

On Wed, Jul 09, 2014 at 09:50:01AM -0700, Rani Yaroshinski wrote: In order to improve the accuracy of the OCR results ? Yes, it is, if you know more details about the images you'll be using, so can do better than Tesseract's guesses. See

Re: [tesseract-ocr] Re: Font Limit = 64 fonts in traineddata, really ??

2014-07-09 Thread Nick White

On Tue, Jul 08, 2014 at 10:49:49PM -0700, shree wrote: My information IS dated - I haven't followed the recent changes. Please see this thread - almost a year old which talked of the upcoming changes for training

Re: [tesseract-ocr] Missing detailed documentation about Unicharset files

2014-07-09 Thread Nick White

Hi Albrecht, On Thu, Jul 03, 2014 at 09:40:51PM -0700, Albrecht Hilker wrote: Generally it is very sad that there is no detailed documentation about Tesseract. I agree. I do work on the documentation, but there is an awful lot missing. I appreciate you taking the time to ask questions here

Re: [tesseract-ocr] need help removing garbage characters from my OCR

2014-07-08 Thread Nick White

Hi Alex, If you're up for some programming, you could recognise the squares yourself, and pass each one separately to tesseract with the PSM_SINGLE_CHAR segmentation type. That should help if Tesseract is not segmenting each whole square separately. If the board is always the same size, you

Re: [tesseract-ocr] retrieve words not matching the dictionary

2014-07-04 Thread Nick White

On Fri, Jul 04, 2014 at 02:08:46AM -0700, Meenal Goyal wrote: If you're sure that all the words you will encounter will be in the dictionary this should help somewhat: https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_to_ increase_the_trust_in/strength_of_the_dictionary?

Re: [tesseract-ocr] New language traineddata based on the existing one.

2014-07-04 Thread Nick White

On Fri, Jul 04, 2014 at 02:15:52AM -0700, Iskander Sharipov wrote: I need to create new tessdata language, which is very similar to russian in charset. Every time I try to do so by training tesseract on a box containing needed letters I get new traineddata, which actually can recognize new

Re: [tesseract-ocr] retrieve words not matching the dictionary

2014-07-03 Thread Nick White

On Wed, Jul 02, 2014 at 10:26:16PM -0700, Meenal Goyal wrote: The post about question about training tesseract only suggests some pre-processing steps which include binarisation and I have already tried them. I wanted to know if anything can be done to improve output at later stage,

Re: [tesseract-ocr] How to download the Tesseract trained data for Digital display numbers ( Seven Segments Data trained data )

2014-07-03 Thread Nick White

Hi Artur, On Wed, Jul 02, 2014 at 10:18:55PM -0300, Artur Augusto wrote: As many people ask about how to use tesseract to read 7 segments display, I decided to publish an open source sample project. If someone wanna check it: https://github.com/arturahttps://github.com/

Re: [tesseract-ocr] Terrible results from Tesseract API

2014-07-03 Thread Nick White

Hi Elena, Just a guess, but maybe this line: api - SetSourceResolution(600); is the source of your troubles? Tesseract from the command line would have just been guessing it, and perhaps its guess, coupled with its ideas about different sizes of fonts, were better than yours? Nick

Re: [tesseract-ocr] retrieve words not matching the dictionary

2014-07-02 Thread Nick White

. , their’ are some words which may not be considered fully as noise but they get filtered out after regex matching. Also, Is there any way to retrain tesseract for improving results in such cases? Any feedback mechanism which can help improve? On Tuesday, July 1, 2014 8:52:35 PM UTC+5:30, Nick

Re: [tesseract-ocr] Tesseract-OCR

2014-07-01 Thread Nick White

On Mon, Jun 30, 2014 at 10:42:41PM -0700, nirali kanani wrote: is there Tesseract - ocr v 3.03 exe available anywhere ? Tesseract v3.03 hasn't been released yet (except as a pre-release version in the latest ubuntu). The code is unlikely to change a lot from what's currently in SVN, so you

Re: [tesseract-ocr] How to use the API in linux system

2014-07-01 Thread Nick White

Hi, On Mon, Jun 30, 2014 at 09:25:23PM -0700, 韩煦深 wrote: I'm a Chinese student and I want to use the tesseract-ocr in our linux system. I have Ubuntu OS and I install tesseract in my ubuntu system. But I don't know how to use C++ API in linux system because all the examples are based on VC++

Re: [tesseract-ocr] retrieve words not matching the dictionary

2014-07-01 Thread Nick White

Hi Meena, On Tue, Jul 01, 2014 at 02:04:36AM -0700, Meenal Goyal wrote: When I try to ocr an image, it also produces some noise apart from the meaningful words. An example output for an image is: All women become like their’ mqthers. _ ' 1’ ' - —T at-{rs their tragedy. ” R-‘»“T‘*'-.

Re: [tesseract-ocr] retrieve words not matching the dictionary

2014-06-30 Thread Nick White

Hi Meenal, On Mon, Jun 30, 2014 at 01:40:10AM -0700, Meenal Goyal wrote: When i run tesseract on my image, it produces some words not present in the dictionary. Is there some way to directly get the list of these words and prevent tesseract from showing them in the output. Example of such

Re: [tesseract-ocr] Advice needed on effective hexadecimal recognition

2014-06-30 Thread Nick White

Hi Scott, On Fri, Jun 27, 2014 at 09:39:21PM -0700, scott.ha...@gmail.com wrote: Hi all. Firstly let me say I am totally blown away by Tesseract, it vastly exceeded my expectations for an open source OCR project. I have an application (http://hackaday.io/project/1569-NSA-Away) that

Re: [tesseract-ocr] read multi-language ( arabic and english) image

2014-06-27 Thread Nick White

On Fri, Jun 27, 2014 at 01:48:52AM -0700, thinker wrote: reading image with multiple language (arabic and english) by using -l ara+eng option gives garbage output. There are currently a couple of bugs with combining Arabic and English together, so it isn't working. I'd recommend you add any

Re: [tesseract-ocr] question on training tesseract for arbitrary big images

2014-06-27 Thread Nick White

Hi Mori, On Fri, Jun 27, 2014 at 01:51:01AM -0700, morteza neishaboori wrote: I want to use OCR to detect small words in images containing indoor signs and etc you can find some sample images in the link below to get the idea

Re: [tesseract-ocr] Support for Sinhala

2014-06-27 Thread Nick White

Hi Sheeyam, sorry for not replying to your emails sooner. On Sun, Jun 22, 2014 at 04:43:27AM -0700, sheeyam shellvacumar wrote: Does Tesseract support sinhala. How do u guys train them ??? Actually i am confused help me It looks like some people have trained Tesseract for Sinhala; see

Re: [tesseract-ocr] Can tesseract read cursive handwriting?

2014-06-27 Thread Nick White

Hi Paulo, On Mon, Jun 23, 2014 at 10:11:28AM -0700, Paulo Basilio wrote: Good day, I am trying to develop a mobile app that can read cursive handwriting (doctor's handwriting to be exact). My question is, can tesseract-ocr read cursive handwriting? If not, can someone give me suggestion for

Re: [tesseract-ocr] 'BLOCK_LINE_IT' was not declared in this scope

2014-06-27 Thread Nick White

Hi Raghavan, On Tue, Jun 24, 2014 at 06:58:56AM -0700, Raghavan P wrote: When i try to make use of tesseract classes like BLOCK_IT and BLOCK_LINE_IT, I am getting the error it was not declared in this scope. May i know what header should i bring in or what am i missing here? Are you using the

Re: [tesseract-ocr] Can tesseract read cursive handwriting?

2014-06-27 Thread Nick White

On Fri, Jun 27, 2014 at 04:57:30PM -0400, Nick White wrote: On Mon, Jun 23, 2014 at 10:11:28AM -0700, Paulo Basilio wrote: Good day, I am trying to develop a mobile app that can read cursive handwriting (doctor's handwriting to be exact). My question is, can tesseract-ocr read cursive

Re: [tesseract-ocr] general tesseract help for coding newbie

2014-06-26 Thread Nick White

Hi Jack, I replied privately, but the gist is that VietOCR is a graphical program that makes Tesseract easier to use on a Mac (as well as Linux Windows). Nick On Thu, Jun 26, 2014 at 08:55:19AM -0700, Jack Kershaw wrote: I am an ancient greek student currently studying A levels. I have been

Re: [tesseract-ocr] Any suggestions on pre-processing to improve accuracy?

2014-06-26 Thread Nick White

On Mon, Jun 23, 2014 at 08:32:52AM -0700, Traun Leyden wrote: One more thing that document should have is a mention of Stroke Width Transform to improve OCR recognition on images that have a lot of non-text content. Oh cool, that looks great! I definitely will add that to the wiki page

Re: [tesseract-ocr] Tesseract with PHP wrapper input stream not found

2014-06-25 Thread Nick White

Hi Eddie, I'd suggest contacting the author of the PHP wrapper, that isn't something provided by the core Tesseract project, and it doesn't look like any issue with Tesseract proper, just with the caller. Nick On Wed, Jun 25, 2014 at 12:36:59AM -0700, Eddie G wrote: I'm using the PHP

Re: [tesseract-ocr] Re: hocr2pdf

2014-06-23 Thread Nick White

Hi Amar, If you can wait for the release of Tesseract 3.03 (or compile the latest version from SVN), that has PDF output built in. Nick On Mon, Jun 23, 2014 at 12:19:52AM -0700, Amar wrote: Hello dear friends, Is HOCR2PDF command line tool limited only to non-windows platforms? I could not

Re: [tesseract-ocr] Any suggestions on pre-processing to improve accuracy?

2014-06-20 Thread Nick White

Hi Traun, Any tips on doing pre-processing on the images to improve the recognition? The place to start would be here: https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality Nick -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To

Re: [tesseract-ocr] limiting to 8 letters and numbers only (LPR)

2014-06-19 Thread Nick White

Hi Ketut, On Tue, Jun 10, 2014 at 11:30:39PM -0700, ketut ariasa wrote: I have a very limited OCR application using tesseract, where I want to recognize only 8 letters and numbers begin with the letter 'D'. Is there a way to restrict tesseract to attempt to recognize only 8 digits letters

Re: [tesseract-ocr] Re: Pharmaceutics OCR recognition project

2014-06-19 Thread Nick White

On Wed, Jun 18, 2014 at 07:30:03AM -0700, Paul wrote: That upper bound actually might be the root of your problem. If you've already compiled Tesseract on your own, try to use a greater number for kMaxUserDawgEdges. If you have not, you could either reduce the number of words in your

Re: [tesseract-ocr] Tesseract 3 doesn't recognize portion of the image with one word inside

2014-06-05 Thread Nick White

On Thu, Jun 05, 2014 at 01:51:24PM +0200, zdenko podobny wrote: On Thu, Jun 5, 2014 at 12:10 PM, 'thakobyan' via tesseract-ocr tesseract-ocr@googlegroups.com wrote: Trying to OCR the portion of the image. For some reason if I cut only one word (see Fail.png and Fail2.png attached)

[tesseract-ocr] Example C-API program outputing UZN zone files

2014-06-05 Thread Nick White

/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/20140605164046.GB5444%40manta.lan. For more options, visit https://groups.google.com/d/optout. /* Copyright 2014 Nick White * * Licensed under the Apache License, Version 2.0 (the License); * you

Re: [tesseract-ocr] Is any one working on a deep neural net implementation?

2014-06-04 Thread Nick White

Hi Debayan, On Wed, Jun 04, 2014 at 01:53:54PM +0530, Debayan Banerjee wrote: I am contemplating porting the classifier to a deep neural net, probably https: //github.com/BVLC/caffe. Anyone already working on this? This should allow Tesseract to recognise some of the more complicated

Re: [tesseract-ocr] Re: Anyone working on Georgian (kartuli ena)?

2014-06-03 Thread Nick White

work at Georgian. Will revert later and share my little experience. среда, 28 мая 2014 г., 19:26:55 UTC+4 пользователь Nick White написал: Hi all, Resurrecting an old thread: has anyone got anywhere training tesseract for Georgian? Or tried

Re: [tesseract-ocr] Re: Using tesseract for read ZONES

2014-05-29 Thread Nick White

Hi Krijesh, On Thu, May 29, 2014 at 03:09:21AM -0700, Krijesh PV wrote: But as you said the command 1. tesseract 8531_001.3B.tif 8531_001.3B_uzn -psm 4 is generating a 8531_001.3B_uzn.txt file. I am not able to getting uzn file. I need to generate a template for my image contents.

Re: [tesseract-ocr] Re: Using tesseract for read ZONES

2014-05-29 Thread Nick White

Hi Krijesh, On Thu, May 29, 2014 at 07:38:17AM -0700, Krijesh PV wrote: i am completely a novice on this topics, can please explain on complete process, how can i create this uzn files are there any tools for that, There aren't any tools to create uzn files, that I know of. You can see how

Re: [tesseract-ocr] Cannot open input file

2014-05-28 Thread Nick White

On Wed, May 28, 2014 at 06:09:00AM -0700, Lutz Wittenmayer wrote: I made a copy of the eurotext.tif and inserted it into the directory where the tesseract.exe is located. Same error message Finally I also placed this tif into the directory tessdata. I got always the meassage Cannot open

Re: [tesseract-ocr] Using tesseract to read numbers

2014-05-27 Thread Nick White

Hi Bernardo, On Mon, May 26, 2014 at 03:58:22PM -0700, Bernardo Meurer wrote: I'm in need of some help, I was wondering if it would be possible to use tesseract to read number plates as the one in the image below. If that is doable, if anyone could give me some directions of where to start it

Re: [tesseract-ocr] Using tesseract to read numbers

2014-05-27 Thread Nick White

Hi Bernardo, On Tue, May 27, 2014 at 01:36:58PM -0700, Bernardo Meurer wrote: Now, I found this sample code which I am trying to test on my plates to see if its successful. I am getting compiling bugs when i try to run it, but i'm not sure this is the place to ask for help in such way. If its

{Spam?} Re: [tesseract-ocr] text2image infinite ScrollView: Waiting for server

2014-05-25 Thread Nick White

Hi Michael. On Sat, May 24, 2014 at 10:38:57PM -0700, Michael Yang wrote: I'm able to compile the text2image training tool, however, I can't seem to get it to work. I've confirmed that the viewer works with the included tesseract tests. I've included the output below. Any help is much

Re: [tesseract-ocr] Tesseract API - hOCR output doesn't match what I get using console

2014-05-24 Thread Nick White

Hi Przemysław, On Sat, May 24, 2014 at 04:11:32AM -0700, Przemysław Woźniak wrote: The problem which I encountered is that hOCR output that I produce using C++ code isn't the same as what I get using tesseract.exe from Windows console. I'm speaking of course about the accuracy of words

1 2 3 4 5 >

1 - 100 of 431 matches

Mail list logo