Re: [tesseract-ocr] Expected output of LSTMTRAINING

2019-01-07 Thread Timothy Snyder
Great! Thanks, Shree. I totally missed that section. On Mon, Jan 7, 2019 at 11:08 AM Shree Devi Kumar wrote: > You need to convert the checkpoint to a traineddata file. > > Please see > https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files > > On Mon,

Re: [tesseract-ocr] Re: Tesstrain.sh fails when provided > 7 tif/box pairs

2019-01-07 Thread Timothy Snyder
Unfortunately this did not work for me. I still have to change these lines in *tesstrain.sh* to successfully run it. phase_I_generate_image ** ... phase_E_extract_features " --psm 6 lstm.train " ** "lstmf" ... phase_E_extract_features "box.train" ** "tr" For mine to work, ** can

Re: [tesseract-ocr] Box file layout for training tesseract4

2019-01-25 Thread Timothy Snyder
I have successfully trained Tesseract 4.0 using boxes that cover an entire line. I was similarly confused by the mismatch between the docs and that example. I haven't tested training with character-bounding boxes but I can confirm that textline boxes works fine. On Fri, Jan 25, 2019 at 5:56 AM Jul

Re: [tesseract-ocr] Should i use lstm training or TIFF/BOX file training?

2019-01-31 Thread Timothy Snyder
When you refer to TIFF/BOX file training, do you mean manually creating your own boxfiles from your own set of images? Note that by default, lstmtraining does generate TIFF/BOX files from the fonts that you specify it to train on. With a little bit of wrangling, you can actually configure lstmtrai

Re: [tesseract-ocr] Ocr-d train - Tesseract 4.0 Training

2019-02-06 Thread Timothy Snyder
I'm pretty sure you have to have a don't for lstm training. When I trained tesseract 4 for hand writing, I used a font that was based on handwriting to fulfill tesseract's requirement for at least one font. On Wed, Feb 6, 2019, 11:10 PM Thanks for your response, Since these are handwritten digits

Re: [tesseract-ocr] Tesseract performs poorly. What am i doing wrong?

2019-02-08 Thread Timothy Snyder
You may want to try segmenting this image into smaller segments and try to remove elements of the table grid to see if you achieve better results. On Fri, Feb 8, 2019 at 9:45 AM narayanan iyer wrote: > I have scaled the image and also did binarization. Still i get bad > results, Is there anythi

Re: [tesseract-ocr] Box file layout for training tesseract4

2019-03-01 Thread Timothy Snyder
Sorry for the delay. You have access now. I need to set the link to public! On Mon, Feb 25, 2019 at 8:10 AM mohito wrote: > Hi, > > would you be so kind to make this link public or give me permissions to > see your examples? > To see an example would help so much. > > Best Regards > > Am Mittwoc

Re: [tesseract-ocr] Failed loading language 'eng' - Windows Server 2016

2019-05-14 Thread Timothy Snyder
Do you have an "eng.traineddata" file in the directory that you specified with the --tessdata-dir flag? On Tue, May 14, 2019 at 9:13 AM Pedro Lima wrote: > Environment: > >- I am getting this error in one specific server (Windows Server 2016 >x64) when I try to use tesseract. Failed load

[tesseract-ocr] What does --noextract_font_properties do?

2019-05-15 Thread Timothy Snyder
Hey all, quick question: What does --noextract_font_properties do when using tesstrain.sh? I've been using the flag for training since it's used in the training guide on GitHub. However, there I can't seem to find any usage information. tesstrain.sh doesn't seem to include it in its usage info:

Re: [tesseract-ocr] Re: Training with a large number of LSTMF files

2019-05-19 Thread Timothy Snyder
I had moderate-to-good success fine tuning the Tesseract 4 english model with handwriting samples from the IAM handwriting database. On Sat, May 18, 2019 at 2:33 PM Shree Devi Kumar wrote: > No, I have not done handwriting training. Others who have tried can share > if they had success. > > On S

Re: [tesseract-ocr] table ocr with tesseract(tess4j)

2019-06-19 Thread Timothy Snyder
Would you be able to provide an example of said table? On Wed, Jun 19, 2019 at 8:40 AM Momene Vigal wrote: > Hello, please im a beginner with tesseract actually using it with java > please can anyone help me with how to do the ocr of a table with > tesseract > in python or java > > -- > You rec

Re: [tesseract-ocr] Doubt with handwritten texts.

2019-06-21 Thread Timothy Snyder
It's not possible out-of-the-box with Tesseract but I've reached ~90% accuracy so far on a handwriting model I'm working on. Check out projects like IAM, EMNIST, and UNIPEN to start collecting handwriting data/images. You will probably want to segment the handwritten text off the check and apply a

Re: [tesseract-ocr] GPU for Tesseract

2019-06-28 Thread Timothy Snyder
I think it means that Tesseract doesn't support nor require hardware acceleration via the GPU. Looks like there is experimental support for OpenCL in Tesseract though it doesn't appear to be a very matured feature. On Fri, Jun 28, 2019 at 1:54 AM Pooja Kamra wrote: > On Tesseract site, it is me

Re: [tesseract-ocr] Train Tesseract to ignore music?

2019-06-28 Thread Timothy Snyder
A picture would be helpful. From my experience, however, writing an independent program to segment text from "noisy" images with a lot of non-text print will give you the best results. Depending on how much the layout of those books varies between pages, this could be a simple or complicated task.

[tesseract-ocr] Parameters to increase tolerance of whitespace between characters?

2019-07-10 Thread Timothy Snyder
Hello all, Does anyone know of any config parameters that will increase the tolerance of whitespace between characters, i.e., increase the amount of whitespace needed to trigger word segmentation? I have many cases in my text where there are extra whitespace between characters resulting in the

Re: [tesseract-ocr] Hand Writing detection using tesseract 4

2019-07-29 Thread Timothy Snyder
Tesseract is not exactly meant nor designed for handwriting recognition though it is possible with the right training. I suggest you become familiar with the Tesseract training process for regular fonts and once you're comfortable with those processes, try and train it with handwriting images. A

Re: [tesseract-ocr] Specific localization and doing OCR

2019-07-29 Thread Timothy Snyder
Are those green boxes a static component of the image or are you calculating them at runtime? In short, there is no way to train Tesseract to seek out those green boxes on its own. If you have the coordinates of the rectangles at the time of recognition you can limit Tesseract's recognition to tho

[tesseract-ocr] Does Tesseract take surrounding words into account during recognition?

2019-07-30 Thread Timothy Snyder
Hi all, My question is within the context of performing recognition on a single textline. My understanding is that tesseract will segment a textline into word segments and then perform recognition on each of those word segments. During recognition, does it take into account the transcription of

Re: [tesseract-ocr] Tweak to lover accuracy for faster processing

2019-08-01 Thread Timothy Snyder
If you're training your own models, try including the --convert_to_int flag when converting from a checkpoint to a traineddata. Otherwise if you're using the base language models, try out the "fast" version in the repository. On Thu, Aug 1, 2019 at 3:08 PM Thomas Mann wrote: > Hi all, > > I was

Re: [tesseract-ocr] Re: Using my own detection instead of tesseract's

2019-08-08 Thread Timothy Snyder
On my project I detect and crop down to textline level on my own. Then, with PSM 13, I give tesseract a single line of text. On Wed, Aug 7, 2019 at 4:50 AM 'Nima Afshar' via tesseract-ocr < tesseract-ocr@googlegroups.com> wrote: > By detection i mean text detection,by the way your right i should'

Re: [tesseract-ocr] using tesseract to read text on tire

2019-08-26 Thread Timothy Snyder
A lot more work has to be done on preprocessing that image. Consider the qualities of printed text that Tesseract is designed to recognize. My advice is to always try and reduce the image to solid black text on a white background before attempting to pass it to Tesseract. On Sun, Aug 25, 2019 at 1

Re: [tesseract-ocr] my scan of alphanumeric data needs TLC

2019-08-27 Thread Timothy Snyder
Try out the single line PSM modes (7 and 13). I've had the best luck with 13 on single line images. Also, see to removing the extra black marks that aren't part of the letters. On Tue, Aug 27, 2019 at 5:12 AM Stephane Charette < stephanechare...@gmail.com> wrote: > I have a large number of images

Re: [tesseract-ocr] Re: Parameters to increase tolerance of whitespace between characters?

2019-08-27 Thread Timothy Snyder
gt; > Anyone know? > > Stéphane > > > On Wednesday, July 10, 2019 at 8:16:55 AM UTC-7, Timothy Snyder wrote: >> >> Hello all, >> >> Does anyone know of any config parameters that will increase the >> tolerance of whitespace between characters, i.e., in

[tesseract-ocr] net_spec with 2D LSTM?

2019-08-27 Thread Timothy Snyder
Hello all, Does anyone have an example of a net_spec argument that utilizes a 2D LSTM? Thanks, -Tim -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-

Re: [tesseract-ocr] How can I train TesseractOCRiOS to recognize handwriting?

2019-08-29 Thread Timothy Snyder
You will have to train it with handwriting samples like IAM handwriting database. On Thu, Aug 29, 2019 at 1:24 PM SlushyPuffin wrote: > Im making an application, the goal is to take a picture of my school notes > and have them processed into just text (so I can have neater notes)... I > have som

Re: [tesseract-ocr] How can I train TesseractOCRiOS to recognize handwriting?

2019-08-29 Thread Timothy Snyder
I would first learn how to train Tesseract with regular fonts. Once you understand that process pretty well, you can think about how you'd go about training Tesseract with samples from something like IAM handwriting database. That process will involve transforming IAM images + metadata files into t

Re: [tesseract-ocr] How can I train TesseractOCRiOS to recognize handwriting?

2019-08-29 Thread Timothy Snyder
Example of what? On Thu, Aug 29, 2019 at 4:19 PM Baking Squad wrote: > Ok thanks! Have you done this before? If so can I have an example? > > Sent from my iPhone > > On Aug 29, 2019, at 4:03 PM, Timothy Snyder wrote: > > I would first learn how to train Tesseract with re

Re: [tesseract-ocr] How can I train TesseractOCRiOS to recognize handwriting?

2019-08-29 Thread Timothy Snyder
r a tutorial on > how I can accomplish it... or how I download what I need to download... > > Sent from my iPhone > > On Aug 29, 2019, at 4:22 PM, Timothy Snyder wrote: > > Example of what? > > On Thu, Aug 29, 2019 at 4:19 PM Baking Squad > wrote: > >> Ok than

Re: [tesseract-ocr] Re: Tesseract.js and traineddata language.

2019-09-03 Thread Timothy Snyder
10 seconds of investigation yielded an FAQ page from the repo explaining how tesseract.js maintains .traineddata files. On Tue, Sep 3, 2019 at 4:21 PM Clint William Theron < theronclintwill...@gmail.com> wrote: > just give me clue! > > On Monday, September 2, 2019 at 11:07:20 PM UTC+2, Clint Wil

Re: [tesseract-ocr] Re: Tesseract.js and traineddata language.

2019-09-03 Thread Timothy Snyder
https://github.com/naptha/tesseract.js/blob/master/docs/faq.md On Tue, Sep 3, 2019 at 4:28 PM Timothy Snyder wrote: > 10 seconds of investigation yielded an FAQ page from the repo explaining > how tesseract.js maintains .traineddata files. > > > On Tue, Sep 3, 2019 at 4:21 P

Re: [tesseract-ocr] How do we pass coordinate to tesseract so that we escape detection process and run only recognition using tesseract

2019-09-06 Thread Timothy Snyder
If you're doing recognition on a single line of text, use --PSM 13 or --PSM 7. They're both for single line images but I've had highest accuracy using 13 over 7. On Fri, Sep 6, 2019 at 6:18 AM Purushotham Rao Eravalli < purushot...@sukshi.com> wrote: > Will it still do detection for that passed

Re: [tesseract-ocr] summarizing LSTM

2019-09-06 Thread Timothy Snyder
Do you want to learn more about neural networks or specifically, a "summarizing LSTM" in a neural network? On Fri, Sep 6, 2019 at 5:05 AM Youcef wrote: > Hi, > > In that page https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs from > officiel github repo, it talks about "summarizing LSTM"

Re: [tesseract-ocr] summarizing LSTM

2019-09-06 Thread Timothy Snyder
. On Fri, Sep 6, 2019 at 9:12 AM Purushotham Rao Eravalli < purushot...@sukshi.com> wrote: > It will be great if you provide any source where we can get > detailed information about the architecture used for tesseract and it's > loss functions or so. > > Thanks > >

Re: [tesseract-ocr] summarizing LSTM

2019-09-06 Thread Timothy Snyder
the link for my second sentence ^ https://githubharald.github.io/ On Fri, Sep 6, 2019 at 9:24 AM Timothy Snyder wrote: > This page goes into a little more details than the VGSL spec page in the > Tesseract repo: > https://github.com/mldbai/tensorflow-models/blob/master/street/g3doc/vgs

Re: [tesseract-ocr] Replacing contrast-enhanced image in PDF with low-contrast original , post-Tesseract

2019-09-10 Thread Timothy Snyder
Functionally that checks out to me. Not sure how you would get the unprocessed image into the pdf though. On Tue, Sep 10, 2019 at 11:47 AM IGM wrote: > I'm OCRing an old catalog with Tesseract (to make a searchable PDF), which > works fine except Tess has a hard time with low-contrast pages like

Re: [tesseract-ocr] Getting started with tesseract-ocr in a web app.

2019-09-13 Thread Timothy Snyder
All your web server has to do is facilitate command line calls to the Tesseract installation on your web server. The web server part is totally independent from Tesseract and as such, I think it exceeds the scope of this forum. Are you comfortable with developing client-server web applications? On

Re: [tesseract-ocr] Ideal config settings for finetuned monospace text?

2019-09-13 Thread Timothy Snyder
Have you tried using PSM 13? I get a few % more than 6 on my dataset. Also, what kind of image preprocessing are you doing? I've reclaimed a ton of accuracy finely tuning my preprocessing. Mind posting some pictures of what you're recognizing? On Fri, Sep 13, 2019 at 2:00 AM Dustin Spicuzza wrote

Re: [tesseract-ocr] Getting started with tesseract-ocr in a web app.

2019-09-13 Thread Timothy Snyder
Perfect. All you have to do is develop services on your server to receive images and send back OCR text. With whatever scripting language you are using on your server, just make a programmatic command line call to Tesseract with the uploaded image and send that text back to the user however you wan

Re: [tesseract-ocr] Getting started with tesseract-ocr in a web app.

2019-09-16 Thread Timothy Snyder
Have you tried calling the tesseract executable from the command line yet? Can we confirm that you've successfully downloaded and compiled Tesseract? On Monday, September 16, 2019 at 5:13:20 PM UTC-4, Clint William Theron wrote: > > com'on guys, you might think this should be easy for me but it'

Re: [tesseract-ocr] Getting started with tesseract-ocr in a web app.

2019-09-16 Thread Timothy Snyder
If you downloaded Tesseract's source code from GitHub (which I think you did), you will have to follow the compilation steps for Linux on this page https://github.com/tesseract-ocr/tesseract/wiki/Compiling#linux On Mon, Sep 16, 2019 at 5:48 PM Clint William Theron < theronclintwill...@gmail.com>

Re: [tesseract-ocr] problems with upper-case character

2019-09-18 Thread Timothy Snyder
No configs I know of but I have similar functionality implemented in a text post-processing step in my OCR pipeline. On Wed, Sep 18, 2019 at 11:19 AM 'Sandra M.' via tesseract-ocr < tesseract-ocr@googlegroups.com> wrote: > I'm using Tesseract with Python. I have an image with 1-6 words in it and

Re: [tesseract-ocr] Handwritten traineddata.

2019-09-23 Thread Timothy Snyder
There is no out-of-the box handwriting support. It is possible to train Tesseract with any image + boxfile so if you can find labelled handwriting images online, you can try it out. On Sun, Sep 22, 2019 at 1:11 PM Ajinkya Khalwadekar < ajinkya.khalwade...@gmail.com> wrote: > Hi, > > Do we have tr

Re: [tesseract-ocr] Preprocessing Tools

2019-10-03 Thread Timothy Snyder
You can use free applications like paint.net or GIMP for single image processing or code your own pipeline with OpenCV in Python or C++ On Thu, Oct 3, 2019 at 4:36 AM Jennil Thiyam wrote: > HI shree, Is there any tools associated with tesseract that we can use for > preprocessing the images? Ple

Re: [tesseract-ocr] Can't detect text that underlined with dotted line (see picture)

2019-10-11 Thread Timothy Snyder
Try PSM 13. We use it and we often have artifacts similar to yours in our images. On Thu, Sep 26, 2019 at 10:29 AM Maya Paluy wrote: > Tesseract can't detect this text with default options. What tesseract > options or image preprocessing may help me? > > -- > You received this message because yo

Re: [tesseract-ocr] Tesseract ocr failed to recognize number from number plate images

2019-10-22 Thread Timothy Snyder
Yes you're going to have to do a significant amount of image processing to transform those license plates into straight black text on a white background. Have you tried out the OpenALPR project? On Tue, Oct 22, 2019 at 4:00 AM Sangharsh Kamble wrote: > [image: 2.jpeg] > > [image: 4.jpeg] > > [im

Re: [tesseract-ocr] OCR results are different on different OS (Linux and Windows)

2019-10-23 Thread Timothy Snyder
Can you create an image similar to yours but without the information? On Wed, Oct 23, 2019 at 7:04 AM Yu Wang wrote: > We use the same version on both Mac OS and Ubuntu. Unfortunately, the > image contains confidential information that can not be shared publicly. > > On Wed, Oct 23, 2019 at 3:10

Re: [tesseract-ocr] Low DPI means game over?

2019-10-28 Thread Timothy Snyder
Which part are you trying to OCR? There's a lot of non-text likely interfering with recognition. On Mon, Oct 28, 2019 at 1:06 PM Abs wrote: > I'm struggling to get the square footage of the attached floor plan image. > > It partially works. Tesseract returns "1474 SQ" but I am hoping for the > f

Re: [tesseract-ocr] i tried Tesseract training for handwritten mathmatical expression recognition but trained data having 100% error rate

2019-12-18 Thread Timothy Snyder
Could you provide sample images from the training and testing set? I haven't tried training Tesseract with single characters at a time but you might want to try training on whole expressions like x+y=0. On Wed, Dec 18, 2019, 11:39 PM Haris Sheikh wrote: > hi i'm using Linux (ubuntu), > i tried t

Re: [tesseract-ocr] i tried Tesseract training for handwritten mathmatical expression recognition but trained data having 100% error rate

2019-12-18 Thread Timothy Snyder
Also, what sort of results are you getting if you recognize one character at a time instead of an entire expression? On Wed, Dec 18, 2019, 11:45 PM Timothy Snyder wrote: > Could you provide sample images from the training and testing set? I > haven't tried training Tesseract