Re: Errors during compilation of Tesseract for windows mobile platform

2008-10-24 Thread Ray Smith
Most of these errors look harmless.you could try adding the compiler define EMBEDDED, or alternatively, delete all the code that refers to sigmenu, as it is not used any more. That gets rid of most of them. I am not sure off-hand whether the remaining errors can be removed so easily, but it is

Re: Training by providing a text file accompanying an image?

2008-11-28 Thread Ray Smith
It is possible, and there are broken bits of code that support that kind of training, but it hasn't been used for years and no longer works, so it would take quite a lot of effort to get it working.Ray. On Thu, Nov 20, 2008 at 4:29 AM, Philipp Lenssen [EMAIL PROTECTED]wrote: Hi! I read through

Re: Why are all 'e'-s 'c'-s here?

2008-12-02 Thread Ray Smith
You can upload files to groups. It wuld help to diagnose your problem. BTW are you using English, or your own training data? Ray. Sent from my G1 Android Phone. On Dec 2, 2008 12:58 AM, udippel [EMAIL PROTECTED] wrote: For me, Tesseract does a good job. The recognition rate is comparatively

Re: Mathematical Formulae recognition

2008-12-11 Thread Ray Smith
This problem has not been attempted before with tesseract. The biggest thing to watch out for is to skip the text line and word finding. You might have significant success just running the classifier on the connected components. Training might be a bit tricky too, since it relies on the text line

Re: Mathematical Formulae recognition

2008-12-16 Thread Ray Smith
to the previous line or an extra line in between. I've also observed that sometimes, the same symbol can be recognized easily when it occurs in a subscript position, but is often mistaken when it occurs in a superscript position. lab. On Dec 12, 8:51 am, Ray Smith theraysm...@gmail.com wrote

Re: Tesseract newbie - No output from tesseract

2008-12-20 Thread Ray Smith
See http://code.google.com/p/tesseract-ocr/issues/detail?id=160Ray. On Fri, Dec 19, 2008 at 5:37 PM, lab la...@lbreyer.com wrote: In my experience, TIFF files sometimes have an alpha layer. The easiest way to ensure a usable image for tesseract is to do these two steps (on Debian) convert

Re: Tesseract newbie - No output from tesseract

2008-12-23 Thread Ray Smith
http://code.google.com/p/tesseract-ocr/issues/detail?id=160 On Mon, Dec 22, 2008 at 11:35 PM, ABB abbhoos...@gmail.com wrote: Link not found :-( On Dec 21, 12:02 am, Ray Smith theraysm...@gmail.com wrote: Seehttp://code.google.com/p/tesseract-ocr/issues/detail?id=160Ray. On Fri

Re: MinGW build issues.

2008-12-23 Thread Ray Smith
Hi Brucey, It would be helpful if you could post a patch, either on the issues list or here. I haven't yet completed my scan of the issues list, but I am trying to incorporate all the portability patches into 2.04 before we go up to 3.0 next year. I have just uploaded some of them to svn. I know

Re: Is it possible to use tesseract in a C program?

2008-12-30 Thread Ray Smith
As with any C to C++ interface, you just have to write a C layer on top of the C++ interface. It isn't very difficult, just ugly.The existing dll api already provides such a C layer. Ray. On Tue, Dec 30, 2008 at 4:11 AM, HK herve75...@yahoo.fr wrote: In tesseract package there is an example of

Re: Multicore?

2009-01-08 Thread Ray Smith
Tesseract uses very little memory. The most optimal way of using multiple cores is simply to have multiple processes running simultaneously, processing different pages. If you want to get more sophisticated than that, you will have to wait for the completion of the thread-safety project, as

Re: How to get decent results?

2009-01-08 Thread Ray Smith
The output is indeed utf-8.Ray. On Thu, Jan 8, 2009 at 11:00 AM, Michael Moore stuporg...@gmail.com wrote: On Thu, Jan 8, 2009 at 11:30 AM, Darren Govoni dar...@ontrenet.com wrote: Hey Michael, I really appreciate the tips. I'm developing an automated batch ocr'ing system and there

Re: recognize just one line

2009-01-12 Thread Ray Smith
There will be explicit support for single line mode in 3.00, mostly for the benefit of ocropus.Ray. On Sat, Jan 10, 2009 at 11:41 PM, federico.boschetti federico.boschetti...@gmail.com wrote: I'm using tesseract in conjunction with ocropus to recognize ancient Greek. Ocropus makes the

Re: Official Windows build of 2.03

2009-01-12 Thread Ray Smith
You need additional dlls and there is too much disagreement in the windows user community over which version of the developer platforms to use. Making windows executables is fraught with problems, which is why I am looking for someone to write a windows installer...Ray. On Mon, Jan 12, 2009 at

Re: Recognizing time and date from a picture

2009-01-16 Thread Ray Smith
Also you might need to scale up. See the FAQ.Ray. On Fri, Jan 16, 2009 at 12:41 PM, spackmann richardspackm...@greenfieldfd.org wrote: Try using ImageMagick and cropping the image down to just the white bottom border which contains the text you want to OCR.

Re: tessdll.dll failed under windows XP

2009-01-21 Thread Ray Smith
Unfortunately the TessBaseAPI isn't included in the tessdll, as the dll is a separate api. It can be fixed by adding the appropriately defined DLLSYM to the class definition. You also have to include the appropriate definition of DLLSYM so that it is defined to be DLLEXPORT when building the DLL,

Re: Installazione sotto Windows

2009-01-26 Thread Ray Smith
Vedi questa pagina per la compilazione su Windows: http://code.google.com/p/tesseract-ocr/wiki/ReadMeRay. On Fri, Jan 23, 2009 at 11:49 AM, Design Software marasco.marco.designsoftw...@gmail.com wrote: Salve a tutti gli utenti del Gruppo! volevo chiedervi una cosa. Ho intrapreso l'utilizzo

Re: Training results versus Testing results

2009-01-26 Thread Ray Smith
Yup, it was the batch.nochop that was making th edifference.Ray. On Fri, Jan 23, 2009 at 12:21 AM, Alatius johan.wi...@gmail.com wrote: Oh! I just realised that if I create the box file with tesseract boxtxtdiff.tif boxtxtdiff batch makebox (i.e. without .nochop'), the same characters are

Re: tessdll.dll failed under windows XP

2009-01-26 Thread Ray Smith
: 'ocrclass.h': No such file or directory 1main.cpp The tessdll.dll indeed contains the TessDllAPI, but why the .h files above cann't be found ? On 1月22日, 上午12时54分, Ray Smith theraysm...@gmail.com wrote: Unfortunately the TessBaseAPI isn't included in the tessdll, as the dll is a separate api

Re: First Step Teeseract

2009-01-26 Thread Ray Smith
Windows isn't completely stupid. The file system accepts / so there is no need to change.Ray. On Wed, Jan 21, 2009 at 1:40 PM, Israel calhei...@gmail.com wrote: ok, i have seen the code. i need to configure the path under the system variable TESSDATA_PREFIX but the back slash is not

Re: Compile errors 2.03

2009-01-26 Thread Ray Smith
If you get the new platform.h for this location: http://code.google.com/p/tesseract-ocr/source/browse/trunk/ccutil/platform.h the problem of _vsnprintf should be solved. Ray. On Thu, Jan 15, 2009 at 8:34 AM, SteveP spohor...@sjm.com wrote: Did you see my post from Jan 13, 2009? This might be

Re: OCR in VB 2008

2009-01-26 Thread Ray Smith
Get a new platform.h from here: http://code.google.com/p/tesseract-ocr/source/browse/trunk/ccutil/platform.hto fix the vsnprintf problem. Ray On Tue, Jan 13, 2009 at 5:28 PM, SteveP spohor...@sjm.com wrote: There is a solution for the compile issues in VS.NET 2008, at least if you got the

Re: comparing Tesseract to Enterprise solutions like Abby Fine Reader

2009-01-27 Thread Ray Smith
A lot depends on your application and the type of image that you want to OCR.Tesseract still lacks page layout analysis. Its character error rate is probably about 2x worse than the best commercial engines, but that will vary according to the image quality. If your image quality and fonts are very

Re: Germanletters and Symbols in Whitelist

2009-02-06 Thread Ray Smith
You can use \u notation eg:\u20ac\u00a3 gives you the Euro sign and the pound sign. The compiler converts the unicodes to utf8 strings. Not sure if old compilers like vc6 support it. You might need to use \xhh to specify utf8 byte codes. Ray. On Fri, Feb 6, 2009 at 7:55 AM,

Re: trouble getting good results

2009-02-28 Thread Ray Smith
2.04 is likely to appear just before 3.00. Its purpose is to incorporate patches that have been provided to the group.and to fix as many bugs/compilation issues as possible so there is a stable base version prior to the reelease of 3.00, which is likely to introduce a new pile of such issues. Ray.

Re: How to decrease Tif file size

2009-03-06 Thread Ray Smith
You might also like to check out the FAQ http://code.google.com/p/tesseract-ocr/wiki/FAQ on color images.Ray. On Fri, Mar 6, 2009 at 6:45 AM, Albert Law a...@snowbound.com wrote: ps: You haven't stated why big TIFF files cause problems. Is it a HD thing or a main memory thing? - Albert

Re: tesseract source code tutorial

2009-03-06 Thread Ray Smith
See where is the documentation in the FAQhttp://code.google.com/p/tesseract-ocr/wiki/FAQ Ray. On Thu, Mar 5, 2009 at 3:33 AM, mynickmynick mynickmyn...@yahoo.com wrote: The tesseract source code is so wide that it's pretty a long journey having to read it all. Could you suggest some tutorial

Re: incomplete output

2009-03-09 Thread Ray Smith
FAQ Minimum text size.Ray. On Mon, Mar 9, 2009 at 3:20 AM, Thomasyi thomasyi2...@yahoo.com wrote: Forgot to write down my system information: Windows XP SP1 Tesseract-OCR 2.03 with windows executables --~--~-~--~~~---~--~~ You received this message

Re: Correction of APPLY BOX during training

2009-03-09 Thread Ray Smith
Unfortunately this just trains incorrect outlines.The problem is that applybox doesn't do forced chopping of touching outlines, but it needs to. You need to render your training text with a small amount of inter-character spacing so that the samples don't touch in the first place. Ray. On Thu,

Re: .NET boxing tool with autoalign and suitable for large training sets

2009-03-12 Thread Ray Smith
I added it to the training wiki. Looks like there is a long list of comments there too...Ray. On Thu, Mar 12, 2009 at 2:40 AM, Ondra stradasi...@gmail.com wrote: Hi all, here http://www.ospilka.com/dl/tessboxer.zip is recoded tessboxer for windows which works with large files without

Re: tesseract in only C (no C++) implementation?

2009-03-12 Thread Ray Smith
Probably. It builds on Android.Ray. On Tue, Mar 10, 2009 at 2:40 AM, mynickmynick mynickmyn...@yahoo.comwrote: Thank you for help Do you guess it's feasible to port it to a linux embedded platform using buildroot and uclibc uclibc++ instead of glibc?

Re: Training Tesseract

2009-03-24 Thread Ray Smith
Yes there is a circular dependency.You get round it by using all the 8 stock files for english while you make your new set. Ray. On Tue, Mar 24, 2009 at 6:09 AM, Ray Renteria r...@robotcentral.com wrote: BTW, I'm on Windows XP and I'm running the command-line version. --Ray

Re: How much training data - if characters are always the same

2009-03-30 Thread Ray Smith
Such small amounts of text may confuse the textline/baseline/word finder.You don't need a large number of samples, but it does add some randomized noise, so multiple samples are desirable. Ray. On Thu, Mar 26, 2009 at 9:05 AM, SteveP spohor...@sjm.com wrote: What I have noticed about tesseract

Re: Is there any way to use the viewer?

2009-03-31 Thread Ray Smith
I have begun a wiki http://code.google.com/p/tesseract-ocr/wiki/ViewerDebugging?ts=1238525607updated=ViewerDebugging page on this subject. This question has not come up much so far. The page will be completed in due course.Ray. On Tue, Mar 24, 2009 at 5:22 PM, Fuad Jamour fjam...@gmail.com wrote:

Re: How much training data - if characters are always the same

2009-03-31 Thread Ray Smith
Tesseract relies on having a significant number of characters to get decent statistics on the baseline position, x-height etc., so a few scattered characters will put it in danger of making a lot of stupid errors. It even struggles with a single line of text. BTW this is the reason the training

Re: How much training data - if characters are always the same

2009-04-01 Thread Ray Smith
? On Mar 31, 9:21 pm, Ray Smith theraysm...@gmail.com wrote: Tesseract relies on having a significant number of characters to get decent statistics on the baseline position, x-height etc., so a few scattered characters will put it in danger of making a lot of stupid errors. It even

Re: How much training data - if characters are always the same

2009-04-02 Thread Ray Smith
i have doubt on how the tessaract OCR is working? that is what are all steps to perform to extract the text from an image? Please explain about this? thanking you. On 4/2/09, Ray Smith theraysm...@gmail.com wrote: If you have the advantage of working

Re: Empirical DangAmbigs generator

2009-04-09 Thread Ray Smith
Interesting result. The problem is that the value of DangAmbigs varies according to the size of the document being OCRed. Very small documents don't benefit from the adaptive classifier at all, so DangAmbigs has very little effect. Very large (eg multipage) documents benefit greatly from the

Re: how to draw image region.

2009-04-09 Thread Ray Smith
The command-line tesseract can read unlv zone files. Take a look at the code in ccstruct/blread.cpp or visit www.isri.unlv.edu for documentation.Programatically, TesseractRect allows you to recognize an arbitrary rectangle. Ray. On Thu, Apr 9, 2009 at 5:40 AM, 74yrs old withblessi...@gmail.com

Re: Found some strange errors

2009-04-16 Thread Ray Smith
This is a known problem. It has never really been tuned for small amounts of text, but does need some work. Ray. On Apr 16, 2009 6:53 AM, George zor...@163.com wrote: We found the reason. As you said: you don't give enough samples to Tess. Thank you. When some words are not complete, Tesseract

Re: Tesseract port to win CE

2009-04-16 Thread Ray Smith
I don't think there is much that cannot be run on wince if it can be modified to run on android... It should be relatively easy. The debug code is about the only code that uses many os functions other than basic file io. Ray. On Apr 16, 2009 7:13 AM, George zor...@163.com wrote: How about it

Re: Defects of Tesseract 2.03 on Debian/Ubuntu?

2009-04-16 Thread Ray Smith
You can get to my icdar paper by searching groups for ray's paper. There is another description of the debug/control variables ifyou follow the documentation link on the tesseract home page. There is a fundamental problem with the tesseract features for very small text. To allow it to recognize

Re: Tesseract development, perhaps an opportunity for some

2009-04-16 Thread Ray Smith
Screen text has some unique problems, in additon to the small text problem. The anti-aliasing used to get more virtual pixels out of a screen make it really difficult to get useful images out of the screen. In addition to this, tesseract's use of a polygonal approximation makkes it difficult for

Re: What causes this error? 6 classes in inttemp while unicharset contains 7

2009-04-22 Thread Ray Smith
This is a problem with the applybox code that needs to be fixed. In 2.03/4 deleting the extra character from the unichaarset is not a problem beyond the fact that it won't recognize them. In 3.00 it gets more serious, as it will be using the order in the unicharset to determine what to output for

Re: tesseract on mac os x

2009-04-27 Thread Ray Smith
You also need -Iccmain in your compiler options and #include baseapi.h in your code. Ray. On Apr 26, 2009 10:06 PM, sai firetr...@gmail.com wrote: I've installed the tesseract 2.01/2.03 and even the read-only edition on mac, successfully, I guess. However, after setting the linker flags,

Re: Tesseract 3.0

2009-05-01 Thread Ray Smith
Sorry, it is still in incubation. THe latest news is that it will not build on VC++6. It is about time it went away. It will hopefully build on VC++express 8 (downloading it and installing now.)Ray. On Thu, Apr 30, 2009 at 5:46 PM, Rob H. hksny...@gmail.com wrote: Is Tesseract 3.0 available

Re: Tesseract options

2009-05-04 Thread Ray Smith
The best documentation for these is still here: http://tesseract-ocr.repairfaq.org/tess_variables_all.htmlRay. On Tue, Apr 28, 2009 at 11:53 PM, g.getsov georgi.get...@googlemail.comwrote: Hello Does anyone have a list of options that could improve (or at least change) the performance of the

Re: Incomprehensible output

2009-05-25 Thread Ray Smith
Looks like the input image was of poor quality or otherwise damaged. Ray. On May 19, 2009 1:02 PM, collimarco collimarc...@gmail.com wrote: I have successfully installed Tesseract through MacPorts along with Italian language package. Tesseract seems to work properly, but when I open the output

Re: OCR on invoices

2009-05-28 Thread Ray Smith
This kind of variability is a bit of a problem, and it seems to occur when the image is of insufficient quality, or the font is far from the training data.At some point, we may find a solution, but for now, the best solution is to retrain on the data you want to recognize. Ray. On Thu, May 28,

Re: Using tesseract for isolated digit recognition

2009-05-28 Thread Ray Smith
With single characters, it loses the ability to find the baseline, xheight etc, so certain sets of characters will all look alike.Having said that, 3.00 will have a single character mode that enables you to at least attempt to recognize them. Ray. On Thu, May 28, 2009 at 2:55 AM, paulfeakins

Re: Error Installing in Debian lenny

2009-05-28 Thread Ray Smith
I think you need the more recent code from svn.Ray. On Wed, May 27, 2009 at 8:23 AM, Adrian adrian03...@gmail.com wrote: Hi, I was trying to install tesseract on my Debian Lenny (Intel 32 bits) and I got the ./configure ok (it tells I can run make), the lines with no, missing and the config

Re: problem by character recognition

2009-05-29 Thread Ray Smith
RTFM. See the FAQ on small text.Ray. On Tue, May 19, 2009 at 1:33 PM, denis56 denis.ergashb...@gmail.com wrote: Here is the link to three files that I mentioned (original, converted with java imageio package, and with Image Converted utility) http://www.speedyshare.com/732780799.html

Re: FATALITY ERROR during Training Tesseract for 7 Segment Display

2009-05-29 Thread Ray Smith
The problem is probably that the textline finder is splitting your characters over multiple lines. While it is not supposed to do this, it does it sometimes. A fix to applybox is needed so it can still work in this situation.Ray. On Thu, May 14, 2009 at 11:26 PM, Raj mail2sun@gmail.com wrote:

Re: Tesseract for Mobiles?

2009-06-01 Thread Ray Smith
have to dig into the tesseract source code and remove the un-supported code. :( Any hints would be greatly anticipated, thanks. Best Regards Liutao On 5月30日, 上午1时31分, Ray Smith theraysm...@gmail.com wrote: Yes, it needs a bit of work to properly compile it, but the EMBEDDED

Re: REQUEST: Norwegian OCR

2009-06-01 Thread Ray Smith
Should be in 3.00.Ray. 2009/4/5 Arno Teigseth arnot...@gmail.com Hei Arnstein, Jeg har litt erfaring med å lære opp, og har noen script du kan bruke som gjør jobben en hel del enklere. Hvis du vil, Kan jeg sende deg dem (har ikke funnet ut helt hvordan jeg legger dem til på tesseract-sidene

Re: Issue : usage of dictionary files ( freq-dawg word-dawg ) in tesseract

2009-06-01 Thread Ray Smith
Yes. This should be resolved in 2.04.Dawg generation will be further improved in 3.00, with abolition of the fixed memory buffer, but the data files and code will not be backwards compatible. The 3.00 dawgs will be tied to a specific unicharset file. Ray. On Fri, Apr 17, 2009 at 12:31 PM, Debayan

Re: Approach for training Tesseract with a new language and/font faces

2009-06-01 Thread Ray Smith
I have added a clarification to the training wiki that might explain this better.Ray. On Mon, Apr 20, 2009 at 12:46 PM, MilanKnizek knizek.co...@volny.cz wrote: I have come recently to Tesseract, since it is used by OGMRIP for OCR of DVD subtitles. First run for the subtitles in the Czech

Re: breaking down of the glyphs

2009-06-02 Thread Ray Smith
You could preprocess the images with a morphological operation, such as dilation to make the fragments touch again, but to solve it in general is a hard problem.Ray. 2009/4/24 tt yury.tarasiev...@gmail.com When making .boxes, re-using my own training results, and with rather brightishly lit

Re: Arbitrarily rotated number with tesseract?

2009-06-06 Thread Ray Smith
It would be a start to train it with the data it has to deal with. You could make use of the fact that it happiuly deals with multi-char strings to train it on all 20 possible answers in several different (at least 8) orientations. Ray. On Jun 5, 2009 4:04 PM, Brian bfor...@gmail.com wrote:

Re: Unable to load unicharset file /root/Download/pytesser/tessdata/eng.unicharset

2009-06-11 Thread Ray Smith
I have made this clearer and bigger on the home page (which everybody merrily ignores anyway) and in the ReadMe wiki.Also updated the FAQ to point to the wiki page. A lot of users have had trouble understanding this, Hopefully it will be clearer now. It will be very important for 3.00, as there

Re: Required free samples to test OCR into my asp.net application

2009-06-11 Thread Ray Smith
Yes! Read the FAQ and the ReadMe wikis to find out how to add support for compressed tif.For other formats, you need 3.00, which is not ready yet. Ray. On Wed, Jun 10, 2009 at 11:16 PM, naresh naresh.kanduk...@gmail.com wrote: One more question i have regarding tesseract ocr engine. Did

Re: Please test 2.04 release candidate!

2009-06-11 Thread Ray Smith
mnjrupp, OK, so I hadn't tested with libtiff, but I just did and it works, but that was building with vc++ express 2008, and using the 2.04 tesseract.sln. I followed my own instructions on the readme wiki, and it worked without problem. You can't use VC++ 2005 because MS changed the file format.

Re: problem in running tesseract.exe

2009-06-11 Thread Ray Smith
Nice to hear that works. The 2.04 sln builds this way by default. When 2.04 is fully released I will add a tarball containing an exe that is built this way, so a lot of users will be able to just download it and run...Ray. On Tue, Jun 9, 2009 at 8:12 PM, Hasnat mhas...@gmail.com wrote: Dear

Re: Usage of Tesseract OCR in Windows CE

2009-06-11 Thread Ray Smith
Try the current svn code, and add the preprocessor definition GRAPHICS_DISABLED.Another that you can try is EMBEDDED, but you might not need it for windows mobile. Ray. On Mon, Jun 8, 2009 at 2:13 AM, Raj mail2sun@gmail.com wrote: Hi, I'm also trying 2 integrate Tesseract OCR on windows

Re: tesseract.exe stopped working

2009-06-11 Thread Ray Smith
Try the 2.04 pre-release on the svn site: http://code.google.com/p/tesseract-ocr/source/checkout http://code.google.com/p/tesseract-ocr/source/checkoutYou need VC++express 2008. Ray. On Sat, Jun 6, 2009 at 10:27 AM, alva alvashe...@gmail.com wrote: Oh and one more thing, the tesseract.txt

Re: tesseract 2.03 configure with cygwin fails to recognize libtiff

2009-06-16 Thread Ray Smith
The problem is that some people complain that there is too little documentation, while others don't read what little there is.I have removed the windows libtiff section from the README wiki, as there was no help for linux there, and put a pointer to the FAQ

Re: tif in garbage out

2009-06-16 Thread Ray Smith
Issues with tiff file reading were fixed for 2.04, now in svn for early testers.With screenshots read the wiki FAQ. Ray. On Tue, Jun 16, 2009 at 12:35 PM, Salahuddin Pasha salahuddi...@gmail.comwrote: I had same problem in MacOS 10.5.x First, install the libtiff from

Re: Complex OCR problem , any tips ?

2009-06-16 Thread Ray Smith
That is a hard problem. I don't think Tesseract will be of much use for handwriting, especially Doctors' handwriting.Ray. On Mon, Jun 15, 2009 at 7:11 PM, umanga umanga@gmail.com wrote: Greetings all, In the project I am working with I have a scanned PDF document.This document has

Re: re-training quality

2009-06-17 Thread Ray Smith
Running the same data through the training system multiple times does not change accuracy in tesseract. It does not use a back-propagation training process at this time.Ray. On Fri, Jun 12, 2009 at 5:39 AM, Yury Tarasievich yury.tarasiev...@gmail.com wrote: Is the quality of recognition

Re: Difference between boxes

2009-06-17 Thread Ray Smith
Looks like 2 different versions to me, but even if they weren't you do get different results with different compilers/architectures, partly due to the use of random numbers in one of the algorithms, and possibly due to different floating point treatment and/or different qsort functions.Ray. On

Re: revision 252 not compliling

2009-06-17 Thread Ray Smith
Have you tried runautoconf first? It seems that if the installed version of the automake tools is vastly different to the one I used to make configure, then it doesn't work.Ray. On Wed, Jun 17, 2009 at 2:25 PM, timmckenna mckenna@gmail.com wrote: Granted I probably have no business in the

Re: image depth

2009-06-29 Thread Ray Smith
In 3.00 you can use leptonica to read the image and then pass the Pix directly to tesseract.Ray. On Mon, Jun 29, 2009 at 8:44 AM, Yury Tarasievich yury.tarasiev...@gmail.com wrote: A P wrote: Yury, Did you use the -depth 8 flag or some other option? Well, I used what seemed to be the

Re: Confidence value for each character

2009-06-29 Thread Ray Smith
You can use TessBaseAPI::TesseractExtractResult, but you will have to hack the code a bit to do it, as it is a protected member. If we can correct the way ocropus uses tesseract, we can make this a useful single public member that anyone can use.Ray. On Sun, Jun 28, 2009 at 2:45 PM, hvthaibk

Re: Problems getting started with libtesseract

2009-06-29 Thread Ray Smith
The problem is that tessdll uses its own api instead of the baseapi.You have 2 possibilities: 1. Rewrite your code to use the dll directly (see tessdll.h) 2. Mark TessBaseAPI as dll export by putting a TESSDLL_API in the class definition and putting the appropriate magic incantations to make

Re: Confidence value for each character

2009-07-03 Thread Ray Smith
character regardless of their neighborings? Moreover, the confidence values usually above 100. Is there anything wrong here as tesseract produces confidence values in the range 0-100 only? Thai On Jun 30, 11:27 am, Yury Tarasievich yury.tarasiev...@gmail.com wrote: Ray Smith wrote: You

Re: Illegal malloc request size! (windows 2.01 exe)

2009-07-03 Thread Ray Smith
some hand image processing with gimp or something else. Ray. 2009/7/2 robi robertmilow...@gmail.com HI, I have the same problem for both 2.04 versions (Linux and Windows) Ray Smith napisał(a): The windows 2.04 executable will be available soon, after I get through the comments

Re: Icons

2009-07-03 Thread Ray Smith
See the training wiki.Ray. On Tue, Jun 30, 2009 at 3:42 PM, taelmx tae...@gmail.com wrote: Hey guys, could this possibly be used to identify icons on a rather large resolution CAD drawing?(Rasterized) It's a symbol that looks like a [T] with diagonal lines in the squareBy chance would I

Re:

2009-07-06 Thread Ray Smith
Sorry about that misleading comment. I have improved the FAQ. The fix in 2.04 is that it works properly with libtiff, NOT that it reads more tiff files without it. Leptonica itself likes to have (doesn't absolutely need) additional imaging libraries (tiff, jpg, png, gif) and then can read all

Re: Why limit training tesseract to 32 fonts?

2009-07-07 Thread Ray Smith
The 32 font limit (MAX_NUM_CONFIGS) was a hardware limit. (Long story) The code that reads the inttemp file in 2.04 and below is specific to the value of MAX_NUM_CONFIGS so you can increase it as long as you retrain yourself. With 3.00, the data file reader is able to read files with a different

Re: Tesseract 3.0

2009-07-09 Thread Ray Smith
*This is a plea for help!* Anyone interested in seeing 3.00 this side of August? Here is the status: Linux: Preliminary alpha release compiles and runs. It is slower than 2.04, due to the new page layout analysis, but the benefits are supposed to outweigh that: Page layout analysis. *Lots* of

Re: Building new language with tesseract, characters touching

2009-07-09 Thread Ray Smith
Done. All the wikis will need a major update for 3.00 when it comes anyway. Ray. On Mon, Jun 1, 2009 at 3:51 PM, Matt Chan talc...@gmail.com wrote: I think I got around it. I wasn't copying over the word-dawg and freq- dawg files from another language or generating them. I just touched empty

Re: Tesseract 3.0

2009-07-09 Thread Ray Smith
-09 at 19:10 -0700, Ray Smith wrote: This is a plea for help! Anyone interested in seeing 3.00 this side of August? Here is the status: Linux: Preliminary alpha release compiles and runs. It is slower than 2.04, due to the new page layout analysis, but the benefits

Re: Tesseract 2.4 API Reference or Documentation?

2009-08-05 Thread Ray Smith
There was no documentation for 2.04 because the api was to change for 3.00. That change has now happened. There still isn't much documetation, but api/baseapi.h is fairly well commented, and intended to be largely selkf-documentuing. Most people prefer examples to api documentation, and they are

Re: What is the use of DLLSYM

2009-08-05 Thread Ray Smith
For windows. Some of the code was part of a windowss appthat was broken into dlls. Ray On Jul 21, 2009 7:41 AM, Sandeep sandeep.a...@gmail.com wrote: Why is DLLSYM defined in the platform.h and then used in front of class and function declarations ?

Re: delimiting paragraphs

2009-08-05 Thread Ray Smith
If there are blank lines between paragraphs, the new page layout will do this for you in 3.00. If not, it willprobably do this in the future. If you want to have a crack at it yourself, you would have to modify the page layout analysis or add it as a postprocess based on the word boxes. Ray. On

Re: Boxing with Multipage tiff

2009-08-05 Thread Ray Smith
The box file needs an extra field on each line giving the page number. Can't remember whether 0 based or 1. I think 0 for the first page. Mftraining and cntraining need no modification. The tr file is just a stream of feature sets, so they don't care. Ray. On Jul 19, 2009 7:45 AM, 74yrs old

Re: Windows Installer

2009-08-05 Thread Ray Smith
Hi and thanks for volunteering. Although tesseract is a command-line program, there are still a lot of users that trip over at the first hurdle of having to unpack the tar.gz of the binary and add the language files in the right place. With 3.00, things are simpler in that I have the vcroj file

Re: DLL Change with 2.04?

2009-08-10 Thread Ray Smith
They were unchanged. 2.04 was mostly a bug and portability release.3.00 on the other hand is completely different. Ray. On Sun, Aug 9, 2009 at 4:57 PM, Daryl c...@daryllafferty.com wrote: I am using tessdll from version 2.03 in a C++ Windows program. I see there is a 2.04 version of

Re: New Line or Paragraph problem

2009-08-10 Thread Ray Smith
Sounds like loss of the last word of the first line, or a soft-hyphen problem to me.Ray. On Sun, Aug 9, 2009 at 4:49 PM, Daryl c...@daryllafferty.com wrote: I am using tessdll in a C++ program. Sometimes, seemingly randomly, Tesseract will join sequential lines together without even a space

Re: Problem with boxfile

2009-08-10 Thread Ray Smith
Part of word rejected due to word being too long.That is one of the reasons why the training wiki says to make your training data look like real words. Ray. On Mon, Jul 27, 2009 at 10:51 PM, Hans Peter Bremer hapebre...@googlemail.com wrote: Hi, i've got a problem with the creating of a

Re: Single character recognition

2009-08-10 Thread Ray Smith
Using 3.00, use api.SetPageSegMode(PSM_SINGLE_CHAR); after api.Init()See api/tesseractmain.cpp. Ray. On Mon, Jul 13, 2009 at 7:11 AM, hvthaibk hvtha...@gmail.com wrote: Hello, I am trying to use tesseract to recognize an image containing one character only. How could I turn off the

Re: Tesseract 2.4 API Reference or Documentation?

2009-08-10 Thread Ray Smith
.. is that right? Does the box files supports that? I think someone had posted this question also... thanks ;) On Wed, Aug 5, 2009 at 3:43 PM, Ray Smith theraysm...@gmail.com wrote: There was no documentation for 2.04 because the api was to change for 3.00. That change has now happened

Re: tesseract 3.0 missing svn file

2009-08-20 Thread Ray Smith
The problem is that the configure script was out of date. I have just updated the configure script and it should now work, unless your system doesn't have the correct version of autotools, in which case you still have to run runautoconf. tesseractmain.cpp moved to the new api directory. Please let

Re: Different output of tesseract with the SAME input

2009-08-20 Thread Ray Smith
See answer to issue 233. Ray. On Thu, Aug 13, 2009 at 1:32 AM, cmm mod...@fbk.eu wrote: Hi! I'm using tesseract (version 2.04 / Linux) to recognize text extracted from images. My problem is that apparently I have different results by appling twice a tesseract function on the same bitmap. I

Re: Multi thread support

2009-08-20 Thread Ray Smith
Thread-safety is not yet available, but will probably be available in a future release. The 3.00 api moves towards this by making it based on an instance of the api instead of static methods. It is nearly possible to use two different apis alternately in the same thread, but using them

Re: Version 3

2009-08-20 Thread Ray Smith
As soon as I can check through the pile of questions and issues that have appeared while I was away. It is already in svn if you want to give it a try. On the other hand, if you are the Q (from STTNG) can you not just wave your hand and make it happen? ;-) Ray. On Mon, Aug 10, 2009 at 11:51 PM, Q

Re: interpret UNLV output.

2009-08-20 Thread Ray Smith
See http://www.isri.unlv.edu/ISRI/OCRtk The ^ before a character indicates that it is suspicious in some sense to tesseract, and ~ indicates a reject. The output is in latin 1 instead of utf-8, and may not work at all for non-latin text. Ray. On Mon, Aug 17, 2009 at 2:52 PM, jia

Re: tesseract sometimes read 8 as 0 and sometimes doesn't, why?

2009-08-20 Thread Ray Smith
Works OK for me with 3.0 (apart from the problem that it puts all the units in a separate column) It might be an adaption error. It is hard to say without being able to reproduce the problem and run it with the viewer. Ray. On Wed, Aug 12, 2009 at 2:11 AM, Alcareru sipulima...@yahoo.co.uk wrote:

Re: Encoding format with TesseractExtractResult?

2009-09-17 Thread Ray Smith
TesseractExtractResult was written by OCRopus, and they only care about single lines, so it has no way of telling the end of line.The text string is already utf-8. It needs no further conversion. If you want access to the end of line flag, the easiest way is to subclass TessBaseAPI and write a

Re: Video SEX online?

2009-10-03 Thread Ray Smith
I keep banning spammers. The number banned is now up to 22. Ray. On Sep 30, 2009 5:43 PM, Chen TsoLin tsolin.c...@gmail.com wrote: Dears: what happens with this mail..@@!!! Administrator, please remove this kind of mail from mail groups~~thanks 2009/10/1 Inga M.

Re: Checking out latest version

2009-10-10 Thread Ray Smith
From the checkout page:# Non-members may check out a read-only working copy anonymously over HTTP. svn checkout *http*://tesseract-ocr.googlecode.com/svn/trunk/tesseract-ocr-read-only On Fri, Oct 9, 2009 at 4:00 PM, John a164666...@gmail.com wrote: J. Garcia wrote: Hi folks, I tried to

  1   2   >