[tesseract-ocr] Uneven Text w/ Some Words and a Single Char, Single Char isn't being Picked Up

2019-12-17 Thread Ryan Mee
I have an image that looks like: [image: zero-bot-nospace.png] I've tried running it through Tesseract with each of the PSM options, but none of them seem to be able to get the '0'. The 1.15 +0.05 is getting recognized, but that's it. Any tips for getting everything? -- You received this

Re: [tesseract-ocr] Re: Training error "Couldn't find a matching blob"

2019-03-19 Thread ryan
Wondering if this issue was fixed in Tesseract 3.05.02. Any ideas? On Friday, August 10, 2018 at 7:51:59 AM UTC-4, Mehul Bhardwaj wrote: > > Hi, > > I went through this discussion thread and updated to Tesseract 3.05.02. > Previously I was working with version 3.05. I was getting the same error

[tesseract-ocr] How to fine tune config files with tesseract to not identify colors/shapes and only pick up text?

2019-01-06 Thread Ryan Stevens
Goal: pulling only characters from the image where tesseract does not identify the bars in the graph. I specifically need the % and the 5 categories of numbers:( >240, 131-240, 80-130, 70-79, <70) How would I fine tune tesseract to reach that goal? Using default tesseract and tesseract with

[tesseract-ocr] Re: use jTesseractEdit training but box edit is empty

2017-06-07 Thread Shaw Ryan
an empty box > file. You'll need to perform some image processing first to make the image > more amenable to Tesseract. > > On Tuesday, June 6, 2017 at 9:44:58 PM UTC-5, Shaw Ryan wrote: >> >> Thank you >> I have uploaded box and tiff >> Please help

[tesseract-ocr] Re: use jTesseractEdit training but box edit is empty

2017-06-06 Thread Shaw Ryan
在 2017年6月6日星期二 UTC+8下午10:49:47,Quan Nguyen写道: > > You may want to attach your TIFF/Box pair here so people can look and help. > > On Monday, June 5, 2017 at 8:58:19 PM UTC-5, Shaw Ryan wrote: >> >> I have created a box file >> >> 在 2017年6月5日星期一 UTC+8下午11:24:

[tesseract-ocr] Re: use jTesseractEdit training but box edit is empty

2017-06-06 Thread Shaw Ryan
Thank you I have uploaded box and tiff Please help 在 2017年6月5日星期一 UTC+8下午6:27:14,Shaw Ryan写道: > > > <https://lh3.googleusercontent.com/-yCj0vqGtWmo/WTUawRK7xkI/ACk/EiGACLCD-G0TQwn7pJc5On1-fYZjLPIfwCLcB/s1600/20170605164727.jpg> > How can I edit the data? > -- You

[tesseract-ocr] Re: use jTesseractEdit training but box edit is empty

2017-06-05 Thread Wang Ryan
I have created a box file 在 2017年6月5日星期一 UTC+8下午11:24:36,Quan Nguyen写道: > > You'd need to provide the box file also. If you do not have one, you can > create the box file using the options provided in the other tabs. > > On Monday, June 5, 2017 at 5:27:14 AM UTC-5,

[tesseract-ocr] use jTesseractEdit training but box edit is empty

2017-06-05 Thread Wang Ryan
How can I edit the data? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group

Re: [tesseract-ocr] how to use tesstrain .sh etc in ubuntu 15.10

2015-11-10 Thread Ryan Baumann
Thanks for this, Nick. I'm just getting around to looking into moving my Latin training into the tesstrain.sh system and this is very helpful. -Ryan On Monday, November 9, 2015 at 11:07:42 AM UTC-5, Nick White wrote: > > OK, I just made a first attempt at some documentation for > tes

[tesseract-ocr] Re: Output cropped words

2015-09-28 Thread Ryan Baumann
, it should work for splitting an image into words. Best, -Ryan On Sunday, September 27, 2015 at 12:38:58 PM UTC-4, Nathan Cain wrote: > > I have a project similar to recaptcha where I need humans to type words > instead of computer ocr. Is there a way for tesseract to split an image >

[tesseract-ocr] How do I improve my accuracy with a small set of numbers?

2015-05-23 Thread Ryan
I have to read sets of numbers from a very large number of cards, but I need good accuracy. 15 digits and a 4 digit pin. There's no check digit, but some digits are the same on every card and the font and spacing are the same.I've attached a sample image below. I tested tesseract on that

[tesseract-ocr] Re: Tesseract With Opencl

2015-05-13 Thread Ryan Baumann
I wrote up my experiments with OpenCL-enabled Tesseract here: http://ryanfb.github.io/etc/2015/03/18/experimenting_with_opencl_for_tesseract.html On Friday, May 8, 2015 at 3:58:42 PM UTC-4, Mohammad Umar wrote: Hi, Any body gained speed-up using tesseract with OpenCL enabled? If any speed

[tesseract-ocr] Re: text2image crash

2015-04-01 Thread Ryan Baumann
://github.com/ryanfb/tesseract_latinocr_docker - Training from fonts works surprisingly well, but if there are significant artifacts introduced by your pipeline/capture process, you may get better accuracy with a manual box/train against images. -Ryan On Tuesday, March 31, 2015 at 3:43:23

[tesseract-ocr] Re: text2image crash

2015-04-01 Thread Ryan Baumann
This appears to be an issue with --find_fonts and/or --strip_unrenderable_words. The following command succeeds for me: $ text2image --exposure=0 --font Helvetica Neue Thin --outputbase=eng.Helvetica_Neue_Thin.exp0 --text=/Users/ryan/source/tesseract/tesseract-ocr.langdata/eng

[tesseract-ocr] Re: Latin language

2014-12-16 Thread Ryan Baumann
. -Ryan On Monday, November 24, 2014 11:16:01 AM UTC-5, Ryan Baumann wrote: Pull requests or patches are more than welcome, as I'm just getting familiar with the Tesseract training process myself. I've just pushed a few changes to get possibly-better output for the training_text and word

[tesseract-ocr] Re: Latin language

2014-11-24 Thread Ryan Baumann
be something where someone with more domain-specific knowledge of both Latin and Tesseract will be able to do a better job than me. Due to the upcoming US holidays, I probably won't be able to do much more work on it this week. Best, -Ryan On Saturday, November 22, 2014 4:15:12 AM UTC-5, Guido

[tesseract-ocr] Re: Latin language

2014-11-21 Thread Ryan Baumann
was slow, so I switched to using Perl so I could apply a non-greedy regex substitution instead (which is much faster): https://github.com/ryanfb/ancientgreekocr-grctraining/commit/069648af2e2b45e41fd7e4ff4390343b45765f77 -Ryan -- You received this message because you are subscribed to the Google

[tesseract-ocr] Training data gets worse as I add characters

2014-11-21 Thread Ryan Dev
I am trying to cover as much as I can of the latin unicode characters in the BMP. What I find is that as I add more characters, the ocr results get worse. For example, instead of getting the correct ö I get Ö and then as I added more characters the latest result is Ṏ. In otherwords, not only

Re: {Spam?} Re: [tesseract-ocr] text2image infinite ScrollView: Waiting for server

2014-11-19 Thread Ryan Baumann
/ryanfb/homebrew/blob/tesseract_training/Library/Formula/tesseract.rb -Ryan On Monday, July 21, 2014 11:36:07 AM UTC-4, Bob Aman wrote: I'm getting something pretty similar, and I've now compiled/recompiled in about 20 different ways with 20 different variations of the assorted dependencies

Re: [tesseract-ocr] Covering ASCII Extended range.

2014-11-19 Thread Ryan Dev
I'm dealing with font subsets, and I generate an image per font, so there is no reading order. Though I've seen latin and cjk in the same font subset. If OSD just gives, reading, orientation, and text order, it is not going to give me anything useful. Plus I have the font, so I could get some

Re: [tesseract-ocr] Covering ASCII Extended range.

2014-11-18 Thread Ryan Dev
Thanks again. you may get better results using appropriate language data rather than just the ascii range. Are the client documents sorted by language? I'm not sure how they have them organised, I just know they want an automatic solution... I am attaching files used - i had just copied

Re: [tesseract-ocr] Configure for single character recognition

2014-11-14 Thread Ryan Dev
It looks like all your characters are uppercase, but if that is not always the case, my experience with doing per character ocr in tesseract is it cannot handle capitalization properly. That is, is it a 'c' or a 'C'? I layout all my characters in a straight line, and get much better results

Re: [tesseract-ocr] Covering ASCII Extended range.

2014-11-14 Thread Ryan Dev
asc traineddata does not have a wordlist or dictionary, so using eng will help with that. You mean unpack the wordlist from eng and pack it into the asc one? Or run tesseract with eng+asc? Currently I run each language in complete isolation from each other, and figure out the results

Re: [tesseract-ocr] Covering ASCII Extended range.

2014-11-13 Thread Ryan Dev
Wow! Awesome. That file definitely helps. It fixed a few issues, but introduced a few of its own, so currently I am running eng+asc and that is giving great output, and is running faster then eng+deu. Attached is an example image and output using asc. Note that asc is getting the 'ü' as a

[tesseract-ocr] Covering ASCII Extended range.

2014-11-12 Thread Ryan Dev
The project I am working on I need to do OCR on documents with characters that are covered by the ISO 8859-1 Extended ASCII range (0x20-0xFF) http://www.ascii-code.com/ I was wondering, does anyone have traineddata files for this? Or do they know which existing language traineddata files would

[tesseract-ocr] Re: 6od instead of God

2014-11-10 Thread Ryan Dev
What PSM mode are you in? I see the H chopped into |-| when using PSM_SINGLE_LINE especially, and I don't think ever with PSM_AUTO. For my project I was running into the same issue, but I know my glyphs are not ever touching or overlapping, so I simply disabled chopping all together. But for

Re: [tesseract-ocr] Passing glyph vector data directly to tesseract

2014-10-31 Thread Ryan Dev
Here is an example of glyphs from one font. The upper case i is ocr'd as lower case L, and the lower case L was ocr'd as vertical bar '|' https://lh6.googleusercontent.com/-q3kSpzpaOfg/VFO9HAwqqUI/AAk/y70T5yE_x7g/s1600/FPDGJB%2BDKFrutiger-Bold80HL.tiff In an earlier post [1] it was

[tesseract-ocr] Passing glyph vector data directly to tesseract

2014-10-24 Thread Ryan Dev
Hi, I have what I think is a unique situation, and I was hoping I could get some hints on how to proceed. I have problem font files, for which I want to fix the unicode mappings for. I also have PDF files with these fonts, so I also have contextual semantics available. Currently I draw all

[tesseract-ocr] Searchable PDF output with oversized font

2014-09-17 Thread Ryan Johnson
Hi all, I'm having problems with tesseract-ocr since upgrading to Ubuntu 14.04 LTS. When I use either hocr or the internal tesseract output for searchable pdfs I get an oversized font that fills the page too quickly and does not follow the text in the image. I scan the images as tiffs at 300

[tesseract-ocr] Re: need help removing garbage characters from my OCR

2014-07-14 Thread Alex Ryan
filter out any color other than brown and white or is your algorithm more sophisticated? If it is, it would be great if you could share the basic idea. Paul Am Samstag, 12. Juli 2014 00:06:29 UTC+2 schrieb Alex Ryan: just wanted to follow up I wrote some simple code to preprocess the image

[tesseract-ocr] Re: need help removing garbage characters from my OCR

2014-07-11 Thread Alex Ryan
just wanted to follow up I wrote some simple code to preprocess the image because I realized I will be doing basically the same image every time so its foolish to try and use Tesseracts binaziration technique which was designed for a different and more general purpose. So basically I just

Re: [tesseract-ocr] need help removing garbage characters from my OCR

2014-07-10 Thread Alex Ryan
Paul, I havent gotten a chance to play around with that yet, but thanks for linking that, I might very well have to go that route. I am having a very confusing issue tho that Im hoping maybe someone can shed some light on. I've been testing out my language traineddata on a bunch of different

Re: [tesseract-ocr] need help removing garbage characters from my OCR

2014-07-10 Thread Alex Ryan
any in the directory, so that cant be it, altho that would have made sense. thanks for all the help! On Thursday, July 10, 2014 11:18:50 AM UTC-7, Nick White wrote: On Tue, Jul 08, 2014 at 10:36:50PM -0700, Alex Ryan wrote: In one of the links tho I saw something about -psm setting. When I

Re: [tesseract-ocr] need help removing garbage characters from my OCR

2014-07-09 Thread Alex Ryan
Thank you SO much for the replies guys!! I read up on those binarization links, and that looks like its going to be a bit out of my wheel house to implement, I see that there is a python/openCV implementation of that paper, but im not sure if I could get that going, as im not familiar with

[tesseract-ocr] need help removing garbage characters from my OCR

2014-07-08 Thread Alex Ryan
I'm trying to make a words with friends cheat for a university project. I'm obviously trying to OCR the tiles from a screen shot of the app. I have tesseract 3.03 set up and running fine, but I'm not getting useable output. I've tried various training methods but so far haven't hit upon the

Porting of current project to Android

2013-11-19 Thread Ryan Strange
Hi, I have a current c++ program which uses tesseract to identify numbers, which is working. I want to port this project to android using NDK and the JNI, which i have done with other projects without tesseract that work perfectly. My problem is now that i have added the tesseract ocr, i have

Re: List of all variables settable by TessBaseAPI::SetVariable()

2012-11-20 Thread Ryan
There is also the TessBaseAPI::PrintVariables method, which includes all the ones listed below plus a bunch of internal variables. FILE* fp = fopen(out_path.c_str(), w); if(fp != NULL) { tess_api-PrintVariables(fp); fclose(fp); } On Monday, June 27, 2011 2:22:45 AM UTC-7, 8flm6 wrote:

Re: Ligature detection

2012-10-25 Thread Ryan
Thanks again! GetBoxText() was the ticket. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to

Accuracy problems : alpha and numeric characters getting switched around

2012-10-23 Thread Ryan
Hi, I am using tesseract to generate unicode mappings for 'corrupt' font files. While I have complete control over rendering of the characters (size, positioning, colors) I am having troubles with accuracy. Mainly tesseract seems to like numbers over letters. In particular, lower case 'l's