from:"Tom Morris"

Re: Does anyone reading these posts below ???

2010-03-07 Thread Tom Morris

On Mar 5, 1:12 am, rthomas remi.tho...@gmail.com wrote: Hi, I think this project is dead. - Many people come here to get an Open Source OCR, do not contribute and go. - The few contributors can't commit the code because the project is locked. - Tesseract needs a complete rewriting. -

Re: Forking tesseract.

2010-05-13 Thread Tom Morris

On Apr 12, 10:31 pm, MARTIN Pierre hicksc...@gmail.com wrote: i've been making some very small changes to the Tesseract project. Most of them are only related to the way things are organized (Visual studio project, etc). However, i plan to be attacking the big deal very soon... In this

Re: Image pre-processing for good OCR results

2011-02-22 Thread Tom Morris

On Feb 20, 9:02 pm, Jon Andersen jande...@gmail.com wrote: My project athttp://RecordAGrave.comis about recording headstones from graves and posting the text and images on the Net so that people can research their family history. I would appreciate some advice on how to pre-process these

Re: What are the real requirements for training?

2012-05-18 Thread Tom Morris

Thanks for reporting on your findings. On Thursday, May 17, 2012 5:50:36 AM UTC-4, Galt wrote: I have used this simple training text and the output is highly accurate. ... I ran this on a 75 page book and virtually every page was perfect, even though the scans themselves were not.

Re: Thoughts on having the training process take font files directly

2012-10-11 Thread Tom Morris

On Wednesday, October 10, 2012 12:59:49 PM UTC-4, Nick White wrote: So I've been tossing an idea around in my head for a while now, and I think it deserves discussion. As I understand it, the box/tif steps basically reduce varying character shapes to basic simplifications, for each font,

Re: Thoughts on having the training process take font files directly

2012-10-12 Thread Tom Morris

On Friday, October 12, 2012 8:50:25 AM UTC-4, Nick White wrote: Hi Tom, thanks for your thoughts. A key reason for not using scans when training is when the character set is quite large, so it would take many pages of real scans to get a few samples of each. Plus I found the process of

Re: one... pixel... difference

2012-10-25 Thread Tom Morris

On Wednesday, October 24, 2012 11:37:18 PM UTC-4, Phlip wrote: Tesseractors: We are using Tesseract for an outside-of-the-box situation - not scanning neatly typed documents. Our situation is a fuzzy, low-contrast picture. But - even when I use many image enhancements, such as leveling

Re: Bank Card Embossing Characters Recongnition

2012-11-22 Thread Tom Morris

Embossed cards are designed to be printed. Have you considered taking an impression and scanning the impression? Or just scanning (magnetically) the magnetic strip on the card? There have been other discussions of training Tesseract for OCR-A (and I think OCR-B). Farrington 7B is another in

Re: Deleting 'comments' from wiki pages

2012-11-30 Thread Tom Morris

Nick, thanks for going through and reviewing the comments for useful info. On Thursday, November 29, 2012 1:47:20 PM UTC-5, Nick White wrote: On Thu, Nov 29, 2012 at 07:16:05PM +0100, zdenko podobny wrote: What about to block posting comments on wiki (after its deletion)? It is more

Re: tesseract testing suite

2013-02-25 Thread Tom Morris

On Sunday, February 24, 2013 11:53:52 AM UTC-5, zdenop wrote: On Sun, Feb 24, 2013 at 12:20 AM, Nick White nick@durham.ac.ukjavascript: wrote: On Fri, Feb 22, 2013 at 03:20:49PM +, Nick White wrote: On Sun, Jun 03, 2012 at 10:27:23PM +0100, zdenko podobny wrote: it looks like

Re: How to instruct tesseract not to use ligatures (i.e. don't use ﬁ, ﬂ... instead fi, fl...)

2013-04-30 Thread Tom Morris

On Monday, April 29, 2013 8:39:57 PM UTC-4, Michael Sander wrote: Yes, I'm doing something similar in python. Do you know of a list of a ligatures so I can convert them to ascii? I know fi and fl are the most popular, but there are probably many more. The list of Unicode ligatures is here:

Re: Tesseract Android App

2013-05-11 Thread Tom Morris

On Friday, May 10, 2013 7:45:13 AM UTC-4, Renard Wellnitz wrote: Regarding publishing the code. I want to do this. Iam not yet sure if i want to use the GPL or Apache 2.0. Also since i modified tesseract and leptonica sources a bit and have several submodules i need to write a proper

Re: best way to train.... or preprocess?

2013-05-17 Thread Tom Morris

On Thursday, May 16, 2013 10:52:51 PM UTC-4, Mike Masinick wrote: So, I have several hundred thousand scans of sports cards that look similar to the attached. I want to scan the text at the top of the page and extract at least the 8 digit number. Ideally more of the text as well, but the

Re: How to include release number when building tesseract project with Visual Studio 2008?

2013-07-02 Thread Tom Morris

On Tue, Jul 2, 2013 at 12:50 PM, zdenko podobny zde...@gmail.com wrote: Hi, I thought about increasing version number in svn to be able to distinguish devel version from released version, but I was not sure about version number. Up to this time it was Ray Smith who changed it after

Re: How to trian tesseract for new fonts?

2013-07-12 Thread Tom Morris

On Friday, July 12, 2013 8:24:12 AM UTC-4, sdk wrote: I have registered for Prima Tools. However, since I am not affiliated to any institution, I am not sure whether they will approve registration. I haven't heard back yet. That's an interesting racket that the University of Salford has

Re: Can somebody send me a copy of Ray Smith's PhD thesis?

2013-08-10 Thread Tom Morris

On Wednesday, July 31, 2013 3:20:43 AM UTC-4, billsmi...@gmail.com wrote: I am doing some research on tesseract-ocr, the most famous OCR software in the world. However, I am stuck in some code(for example, the feature extraction algorithms in the function ExtractIntFeat). I took long time

Re: Output of tesseract is not as useful without font baseline information?

2013-09-12 Thread Tom Morris

On Wednesday, September 11, 2013 11:14:22 AM UTC-4, ch...@sc3.net wrote: What I'm struggling with, is that the position information that Tesseract gives me doesn't seem to allow me to position the characters for display or for creating a stream to insert back into my PDF file. If I want

Re: Different Results on Linux vs Windows

2013-11-06 Thread Tom Morris

On Tuesday, November 5, 2013 12:38:57 PM UTC-5, rkomar wrote: So, I would say that totally different results with code compiled with different options or different compilers is a sign of bad code. I don't think it should just be shrugged off as unavoidable. I agree with Rob here.

Re: Tesseract 3.0 Performance down at 32bit Os

2013-11-07 Thread Tom Morris

The discussion of vectorization vs parallelization is interesting, but doesn't have anything to do with the question that was asked. Perhaps a new thread is in order? Niral - An order of magnitude difference in performance is unlikely to be attributable to 32-bit vs 64-bit. You left out all

Re: Remove text of a specific font type from image

2013-11-15 Thread Tom Morris

On Tuesday, November 12, 2013 12:18:59 PM UTC-5, Vishalkpp wrote: Sorry for missing the attachment. Please find attached the image. The numbers around the 2D diagram are the annotations I would like to address. It's hard to extrapolate from a single example, but this feels more like a

Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

2013-12-07 Thread Tom Morris

It's great to have another open source tool in the toolbag. (It's GPL v3 BTW for those who don't appreciate the irony of distributing a GPL license in proprietary Microsoft Word format.) I'll echo what the others have said about openness and freedom, or lack thereof. Not only is the Aletheia

Re: Processing speed difference between Windows and Solaris

2013-12-07 Thread Tom Morris

On Friday, December 6, 2013 5:46:39 AM UTC-5, Casper Kent wrote: I am trying to use Tesseract to process coupon images. The recognition result is great. However the processing speed varies hugely between my Windows test environment and the Sparc/Solaris production environment. A batch

Training Tesseract for early printed text

2013-12-07 Thread Tom Morris

In watching Bryan Tarpley's Franken+ presentation ( http://emop.tamu.edu/node/54) it's pretty obvious from the example that there are (at least) two clusters of glyphs for the letter 'o': a tall skinny glyph and a round glyph.

Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

2013-12-07 Thread Tom Morris

p.s. You probably want to add a .gitignore file so that you aren't committing binaries to the repository. Also, it seems like Franken+, as a standalone tool, really could use its own repo. They're lightweight and free and that would give you a separate bug tracker, wiki, etc, as well as the

Re: Proposed new page for the wiki: PoorQuality

2013-12-14 Thread Tom Morris

On Friday, December 13, 2013 11:25:42 AM UTC-5, Nick White wrote: I've drafted such a page, and I'd be keen to get feedback on it. Is it clear? Is it a good idea? I haven't filled out all of the Image processing sections yet, but (presuming people don't hate the idea in general) I will do

Re: Individual character variation lists

2014-03-18 Thread Tom Morris

On Wednesday, March 12, 2014 7:57:38 AM UTC-4, John Green wrote: *What I'm doing: *As part of a longer pipeline, at one step I am reasoning over very small but highly characteristic strings like drug dosage, 60 mg. Edit distance (Levenshtein or a variation) and n-grams, even unigrams,

[tesseract-ocr] Re: Advice needed on effective hexadecimal recognition

2014-06-28 Thread Tom Morris

p.s. On Saturday, June 28, 2014 12:39:21 AM UTC-4, scott...@gmail.com wrote: 3) Attempted to increase the strength of dictionary matches as discussed on the FAQ ( https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_to_increase_the_trust_in/strength_of_the_dictionary?), both via API

[tesseract-ocr] Re: [tesseract-dev] Re: tesseract 3.04 can be downloaded as a package for msys2 (will work on windows)

2014-08-28 Thread Tom Morris

On Wed, Aug 27, 2014 at 4:51 PM, zdenko podobny zde...@gmail.com wrote: there is no git command for it (well maybe we could track down the revision number and tag it, but...) Isn't tagging releases good software engineering practice regardless of all this other discussion? Looks to me like

[tesseract-ocr] Re: Problem with Binarization

2014-09-18 Thread Tom Morris

On Tuesday, September 16, 2014 9:11:35 PM UTC-4, Albrecht Hilker wrote: Tesseract requires test to be 300 DPI. Resize your images with factor 3 and try again. The built-in thresholder should be enough for your samples. (No Gimp required) Or better yet, rescan at the higher resolution so

[tesseract-ocr] Re: Q: OCR on document with registerable marks?

2014-10-20 Thread Tom Morris

On Thursday, October 16, 2014 11:04:58 AM UTC-4, Zunair Fayaz wrote: need best practice to OCR on documents with + sign that helps align the documents. Those are typically referred to as registration marks Any known practice? I would have thought they'd be pretty easy to detect using

[tesseract-ocr] Re: Tesseract for recognition the international phonetic transcription

2015-04-23 Thread Tom Morris

Has anyone tackled training for the IPA since this initial query? I'm considering using Tesseract to OCR the first edition of the Oxford English Dictionary (as input to a crowdsourced proofing process) and trying to decide whether it's worth training it to recognized the pronunciations. I'm

Re: [tesseract-ocr] Re: Multiple tifs to one file

2015-04-23 Thread Tom Morris

On Thursday, April 23, 2015 at 9:03:02 AM UTC-4, Stathis L. wrote: Point is it's a 800-900 page tiff that I want to create and these solutions don't really work, as making a list is not a question and VietOCR runs out of memory. Anything else? Are you replying to Quan or Zdenko? It helps

[tesseract-ocr] Re: Is there any way to speed up extraction using tesseract OCR Engine, while tiff file is having 600-700 pages?

2015-04-19 Thread Tom Morris

On Sunday, April 19, 2015 at 8:23:20 AM UTC-4, James Worldprogram wrote: During processing of tiff files, which are having *600 - 700 pages* from Tesseract OCR engine with hocr option, we monitored that files are taking around *40 - 50 minutes*. So, 14-15 pages/minute or 4-5 seconds/page.

[tesseract-ocr] Re: OCR technology referênces?

2015-04-28 Thread Tom Morris

On Tuesday, April 28, 2015 at 2:17:40 AM UTC-4, Lenilson Castro wrote: I'm beginning an academic work about OCR technology on mobile devices and i was looking for some fonts for a bibliographic revision about that. Does anybody know where do i can find good referênces about OCRs history

Re: [tesseract-ocr] Re: Tesseract for recognition the international phonetic transcription

2015-04-30 Thread Tom Morris

all non-ascii characters) in the digital 2nd edition OED, which is probably a very good starting point for generating training texts. Nick On Thu, Apr 23, 2015 at 01:32:12PM -0700, Tom Morris wrote: Has anyone tackled training for the IPA since this initial query? I'm considering using

[tesseract-ocr] Re: Extracting molecular labels from biological pathway images

2015-05-06 Thread Tom Morris

applicable. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2732221/ http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3265968/pdf/nihms241943.pdf http://cda.ornl.gov/publications_2011/Publication%2028596_S.%20Xu.pdf On Wednesday, May 6, 2015 at 2:41:13 PM UTC-4, Tom Morris wrote: You might consider looking

[tesseract-ocr] Re: Extracting molecular labels from biological pathway images

2015-05-06 Thread Tom Morris

You might consider looking at some of the papers on text detection in natural images and using the techniques from the later stages in the pipeline. These are similar what Dmitri outlined, but reviewing what others have done might give you ideas on additional ways to filter and group

[tesseract-ocr] Re: What pre-processing has Tesseract already included?

2015-05-14 Thread Tom Morris

On Thursday, May 14, 2015 at 12:54:26 AM UTC-4, smwikipedia smwikipedia wrote: I am trying to add some pre-processing post-processing for my scenario. I read some papers and it seems the steps could be: - noise attenuation - orientation correction - binarization -

[tesseract-ocr] Re: tesseract export into txt vs. into pdf (issues with some characters)

2015-05-14 Thread Tom Morris

On Friday, March 13, 2015 at 8:29:13 AM UTC-4, Jan wrote: I noticed that when I use tesseract to create a searchable pdf (I use pdfsandwich fot this), some characters are not displayed and are replaced by blank spaces instead. If I, however, ocr the same file with tesseract only in order

[tesseract-ocr] Re: Images and text

2015-05-18 Thread Tom Morris

While some people think OCRopus has a more sophisticated page layout analysis, it's not true that Tesseract does no layout analysis. The man page describes the different types of segmentation that it does: https://tesseract-ocr.googlecode.com/git/doc/tesseract.1.html Tom On Monday, May 18,

[tesseract-ocr] Re: Math / equation detection module for Tesseract 3.02

2015-05-15 Thread Tom Morris

A few notes on equation detection: - In the 3.02 announcement https://groups.google.com/forum/#!searchin/tesseract-ocr/equation/tesseract-ocr/EXyGqT9osrw/y1xMAujqZy4J, the feature is listed as *experimental *equation detector (emphasis added) - there's no documentation on what's actually in

[tesseract-ocr] Re: Math / equation detection module for Tesseract 3.02

2015-05-15 Thread Tom Morris

p.s. For anyone interested in the topic, here's a masters thesis on math equation detection and segmentation (and the work was done using Tesseract): https://vtechworks.lib.vt.edu/handle/10919/46724 https://vtechworks.lib.vt.edu/bitstream/handle/10919/46724/Bruce_JR_T_2014.pdf?sequence=1 --

[tesseract-ocr] Re: Improve OCR accuracy

2015-06-22 Thread Tom Morris

On Monday, June 22, 2015 at 7:56:51 AM UTC-4, Gunasekaran Velu wrote: I have attached the image as well as Tesseract OCR result for attached image screen shot. the below OCR some words are missing from OCR how can i improve the image quality to detect the missing words. The attached

Re: [tesseract-ocr] Inuktitut OCR problems - ᐃᓄᑦᑎᑐᑦ (Euphemia typeface)

2015-06-24 Thread Tom Morris

In addition to Art's training data, you might also want to test the IKU language data for Tesseract 3.04 that Google released a few hours ago: https://github.com/tesseract-ocr/tessdata/blob/master/iku.traineddata It was generated from the source language data here:

[tesseract-ocr] Re: Selecting OpenCL device?

2015-06-26 Thread Tom Morris

On Thursday, June 25, 2015 at 4:23:32 PM UTC-4, subo...@gmail.com wrote: I've been playing around with the OpenCL option for Tesseract. It appears on very first runs, it profiles the system and tries to determine which is the fastest compute device and uses that device for future runs. Is

Re: [tesseract-ocr] Inuktitut OCR problems - ᐃᓄᑦᑎᑐᑦ (Euphemia typeface)

2015-06-24 Thread Tom Morris

That's cool that there's already a starting point for the IKU language training. To help you understand the various files in Art's repo and the process used to create them, here's the wiki page which describes the training process:

[tesseract-ocr] Re: poor recognition of 'fi'

2015-06-16 Thread Tom Morris

It's difficult to tell what the problem is without any example images. Are you saying that there are ligatures in the image and you don't want them recognized as such or that there are not ligatures, but the characters are touching due to low resolution or poor quality scan or over inking or

[tesseract-ocr] Re: hOCR editor

2015-05-27 Thread Tom Morris

On Tuesday, May 26, 2015 at 12:17:32 AM UTC-7, Stathis L. wrote: Does anybody know of a software editing the hOCR output of tesseract (besides a simple text editor) ? This Firefox add-on might do what you want: https://addons.mozilla.org/en-us/firefox/addon/hocr-editor/ Tom -- You

[tesseract-ocr] Re: jbig2 encoding in PDF output file

2015-07-01 Thread Tom Morris

On Monday, June 29, 2015 at 3:57:08 AM UTC-4, Jeff Breidenbach wrote: Not available currently, and pretty major effort required to make it happen, both in Leptonica and Tesseract's PDF output module. No plans to work on this. For other formats we try hard to not re-encode during PDF

[tesseract-ocr] Corpus for word frequencies in eng.cube.word-freq ?

2015-07-01 Thread Tom Morris

When I look at the word frequencies in eng.cube.word-freq, they look more like what I would expect from analyzing a web corpus rather than a corpus of printed materials (of any era). The list starts off okay: #1 the 13675 #2 of 15222 #3 and 15473 #4 to 15694 #5 a 17149 but then we have: #29

Re: [tesseract-ocr] Tesseract returns empty result with custom language but not english

2015-07-06 Thread Tom Morris

Be sure to check https://github.com/tesseract-ocr/langdata before assuming that the language that you need isn't supported. Dozens of new languages were added a couple of weeks ago. Tom On Monday, July 6, 2015 at 9:09:06 AM UTC-4, Brennan Nunamaker wrote: For clarification: With text, I

[tesseract-ocr] Re: Extract Rotated/Tilted Text from Scanned Image

2015-07-31 Thread Tom Morris

On Friday, July 31, 2015 at 3:26:16 AM UTC-4, Merv wrote: The task I am facing is extraction of rotated text (any angle) from a scanned image. Kindly find the link : http://1drv.ms/1OS8elW which has the sample PDF document that I need to OCR. The sample contains a blue sticker on it and I

[tesseract-ocr] Re: is hOCR the best route to convert a large number of repetitive forms into structured data?

2015-07-14 Thread Tom Morris

On Tuesday, July 14, 2015 at 2:47:40 AM UTC-4, James Owers wrote: You should consider also using the PAGE format. You can use this tool for conversion: http://www.primaresearch.org/tools/TesseractOCRToPAGE Most PAGE format tools aren't available as open source and use a custom license

[tesseract-ocr] Re: Is this too ambitious?

2015-07-16 Thread Tom Morris

Your goals don't sound unreasonable, but I'd suggest using an approach that focuses on pre and post processing before diving in and hacking on tesseract itself. That will allow you to easily continue to track improvements in base tesseract without having to worry about re-integrating your

[tesseract-ocr] Re: Maximum box/window/font size

2015-07-16 Thread Tom Morris

On Monday, July 13, 2015 at 1:23:10 AM UTC-4, Elan wrote: I am fairly new to tesseract, I have done some playing around with training new fonts, and loading config files etc. I have an issue with the images I am trying to OCR. In many cases, there is a dotted horizontal line about 5-10

[tesseract-ocr] Re: Tesseract Reading columns

2015-07-16 Thread Tom Morris

On Thursday, July 16, 2015 at 1:07:11 AM UTC-4, Robertas Kaunas wrote: Hello, everybody. I am trying to read two columns between whose are big gap. So which setting should I modify to let tesseract reading word from second column as in one line with word in first column? Look at the page

Re: [tesseract-ocr] Re: jbig2 encoding in PDF output file

2015-07-17 Thread Tom Morris

Thanks for the analysis and feedback, Jeff. Unfortunately, I don't know much about QPDF (and SourceForge's storage problems are preventing me from learning any more), but doing #3 externally using a tool like QPDF, perhaps in conjunction with doing #1 in Tesseract itself, sound like reasonable

Re: [tesseract-ocr] "Empty Page" and incomplete text recognition

2015-10-28 Thread Tom Morris

On Tuesday, October 27, 2015 at 4:49:11 PM UTC-4, Daniel Kraft wrote: > > > On 2015-10-27 16:10, Allistair wrote: > > > I was able to get it reading everything by cropping it to the same > > amount as Working but then rotating it anti clockwise by just a few > > degrees - I tried this because

[tesseract-ocr] Re: How to extract bounding box only? If I do not need the word/characters classifier.

2015-10-28 Thread Tom Morris

On Wednesday, October 28, 2015 at 4:46:14 AM UTC-4, jinh...@google.com wrote: > > > First, I have very little knowledge about ocr/tesseract. > ... Please help. > If only you worked for Google, you could probably get help directly from the Google software engineers. Oh, wait. You DO work

Re: [tesseract-ocr] how to use tesstrain .sh etc in ubuntu 15.10

2015-11-12 Thread Tom Morris

On Thursday, November 12, 2015 at 8:51:06 AM UTC-5, sriranga(82yrsold) wrote: > > > Alternatively, kindly forward copy of your commandlines used for > *grctraining* > - since http://ancientgreekocr.org/grctraining.git >

Re: [tesseract-ocr] Is there any difference using Tesseract on a mac or pc ?

2015-10-15 Thread Tom Morris

What happens when you use the same command line flags on both machines? You show no flags being used on the Mac. What version of Tesseract are you using? Did you build the binaries yourself (what compilers?) or did you get them from somewhere (where?)? What image are you trying to recognize?

[tesseract-ocr] Re: obtaining pre-processed image

2015-10-18 Thread Tom Morris

On Sunday, October 18, 2015 at 2:00:39 AM UTC-4, Mayu Shukla wrote: > > > But i am not able to understand where to look for and which config file in > config folder,since there are many. which one tesseract calls and how it > works? > The config file is the one you specify on the command line.

[tesseract-ocr] Re: Suggestions on running PDFs through Tesseract without losing vector graphics?

2015-09-10 Thread Tom Morris

On Thursday, September 10, 2015 at 2:31:03 AM UTC-4, hmmwhat...@gmail.com wrote: > > On Friday, September 4, 2015 at 9:38:20 PM UTC-7, Jeff Breidenbach wrote: >> >> But I would like to see an example PDF - one of the simpler ones - just >> to see how the vector graphics were done. Please do not

[tesseract-ocr] Re: Recognition affected by blank space

2015-09-10 Thread Tom Morris

Are you doing any pre-processing besides cropping? If those images are representative and the colors are constant, I'd replace the orange background with black and then invert the image to give black digits with no border on a white background. Also use the page segmentation mode for a

[tesseract-ocr] Re: Use Tesseract to capture digits in a picture

2015-09-10 Thread Tom Morris

July 2015 22:55:25 UTC+5:30, Tom Morris wrote: >> >> I'd suggest looking at OpenCV. It looks more like a computer vision task >> than an OCR task. Some of the specific issues like dials not fully aligned >> in the window are things the OCR systems aren't designed to

[tesseract-ocr] Re: tesseract-ocr does not very well on chinese

2015-09-10 Thread Tom Morris

On Wednesday, September 9, 2015 at 1:30:38 PM UTC-4, Max Heiber wrote: > > Here's an example where the Chinese characters are very large and clear, > but Tesseract gets the wrong result. Could you advise on what image > processing could help Tesseract's accuracy? > What have you tried so far?

Re: [tesseract-ocr] Problems with make script with of head version on a Synology system.

2015-09-11 Thread Tom Morris

On Friday, September 11, 2015 at 3:29:45 AM UTC-4, worldwar...@googlemail.com wrote: > > > Thank you for the fix in GitHub! > You're welcome! Hopefully the other hints will help you report/fix the next bug you find. Tom -- You received this message because you are subscribed to the Google

Re: [tesseract-ocr] Re: Suggestions on running PDFs through Tesseract without losing vector graphics?

2015-09-11 Thread Tom Morris

I don't know where all this complexity came from. PDF rasterizers have existed since the format was invented. GhostScript is one popular open source alternative. It could either be used directly or through a tool that embeds it such as ImageMagick. Tools like Apache PDFBox can be used to add

[tesseract-ocr] Re: user-words / bazaar

2015-09-30 Thread Tom Morris

Perhaps this is just a misunderstanding or bad documentation. The --print-parameters dump shows the input parameters, and the user_words_file / user_patterns_file parameters, if they're not set on the command line, will always be empty. The actual file name that gets loaded gets computed on

Re: [tesseract-ocr] Problems with make script with of head version on a Synology system.

2015-09-09 Thread Tom Morris

On Wednesday, September 9, 2015 at 9:09:39 AM UTC-4, worldwar...@googlemail.com wrote: > > I can confirm this bug! I tried to compile compile tesseract-3.04.00 on a > QNAP TS119 NAS with a Marvell Kirkwood with many tool from > optware/cs08q1armel/cross/unstable/ . I get exactly the same error

[tesseract-ocr] Re: user-words / bazaar

2015-09-24 Thread Tom Morris

On Monday, September 21, 2015 at 9:29:39 AM UTC-4, Stef wrote: > > I'm trying to use user wordlists with the bazaar config but it seems to > have no effect on the OCR result in my case. Therefore I printed the > current parameters to verify whether the user-words list is used. This > confirmed

[tesseract-ocr] Re: no attribute 'TessBaseAPI'

2015-10-05 Thread Tom Morris

You might try pytesseract instead to see if you have better luck: https://pypi.python.org/pypi/pytesseract Tom On Friday, October 2, 2015 at 1:57:56 PM UTC-4, Chang Alden wrote: > > I have python-tesseract-0.9.1-py2.7 installed but I am getting this error. > > Traceback (most recent call last):

Re: [tesseract-ocr] Train tesseract 3.04 for recognition of six patterns no existents in UTF-8

2015-10-05 Thread Tom Morris

I think Dmitri's suggest to start simple is a good one, but, if you need it, don't forget that you've got a lot of other information that can be leveraged to help. The notes all have a fixed aspect ratio (and size?). They've got a relatively standard layout. The denomination is encoded

[tesseract-ocr] Re: Two visually identical images - Tesseract finds text from one but not the other

2015-10-05 Thread Tom Morris

If you think those images are visually identical you should visit your optician. :-) The ImageMagick version is much blurrier, so I'd guess that the high frequency noise from the pixelated OpenCV image is making Tesseract unhappy. If you want to continue using OpenCV, try applying a Gaussian

Re: [tesseract-ocr] how to test the file kan.unicharambigs used for tesseeract-ocr 3.04 or 3.05

2015-12-08 Thread Tom Morris

cessor* being external program is - one should >> have update the post-proc.text everytime for each ocred >> I am puzzled why unicharmabigs does not work as internal program >> correctly - when the post processor program works fine? >> With regards, >> sriranga(83yrs) >>

Re: [tesseract-ocr] how to test the file kan.unicharambigs used for tesseeract-ocr 3.04 or 3.05

2015-12-07 Thread Tom Morris

Hi Sriranga. I haven't used the training tools, but since no one else has answered, I'll give it my best attempt. Shree might have better insights. First, a question of clarification. Are you having problems with the file or are you just trying to determine whether it is working properly or

[tesseract-ocr] Re: Problem in TIFF/Box Generator in jTessBoxEditor

2015-12-02 Thread Tom Morris

On Wednesday, November 25, 2015 at 12:54:25 PM UTC-5, Nasim Ali wrote: > > Nguyen (program creator) says the problem is with java, so I've decided to > use qtboxcreator to create boxes and the subsequent work is handled by > jTessBoxEditor. > To expand on this, it doesn't look like Java

[tesseract-ocr] Re: using Tesseract with Embarcadero RAD Studio 10 C++Builder

2015-12-10 Thread Tom Morris

I don't see what the IDE has to do with anything. https://github.com/tesseract-ocr/tesseract/wiki/APIExample On Wednesday, December 9, 2015 at 10:22:51 AM UTC-5, Matthias Schneider wrote: > > Hi all, > > I'm currently trying to get Tesseract working with Embarcadero RAD Studio > 10 C++Builder.

Re: [tesseract-ocr] how to test the file kan.unicharambigs used for tesseeract-ocr 3.04 or 3.05

2015-12-10 Thread Tom Morris

xts that you're trying to correct. Tom > your suggested sentence > "Novv is the time to go dovvn" also corrected. Please note I regenerated > eng.traineddata in ubuntu 15.10. > With regards, sriranga(83ys) > > On Wed, Dec 9, 2015 at 12:04 AM, Tom Morris <t

[tesseract-ocr] Re: Tesseract seems to be removing correctly segmented and oriented blocks for the final classification

2015-12-22 Thread Tom Morris

On Tuesday, December 22, 2015 at 2:04:26 AM UTC-5, Utkarsh Sinha wrote: > > I'm trying to find out why Tesseract is rejecting certain blobs from the > image here. The text "nestle" and "nesquik" have overlapping baselines. I > suspect the overlap might be causing it to stop recognizing anything

Re: [tesseract-ocr] v3.04 Release???

2015-11-18 Thread Tom Morris

Although it wasn't tagged with a version, you'll also want the relevant language data files from https://github.com/tesseract-ocr/langdata. Note that this is a source release. There are binary releases available from some of the distribution packagers such as Debian:

[tesseract-ocr] Re: Products Expiration date recognition

2016-06-03 Thread Tom Morris

On Friday, June 3, 2016 at 10:09:04 AM UTC-4, Cristian wrote: > > > I'm new on tesseract. I'm working on application that has to recognize the > expiration date of some products like foods. The input will be an image > (very good resolution) with only the date on it. > Before putting my hand on

[tesseract-ocr] Re: Сommercial purposes

2016-06-02 Thread Tom Morris

On Thursday, June 2, 2016 at 6:40:55 AM UTC-4, Вадим Авлочинский wrote: > > Can I use the Tesseract OCR lib to develop my applications and use it for > commercial purposes? > The license is pretty liberal. Your lawyers can find it here:

[tesseract-ocr] Re: Unable to identify simple 6 digit numbers

2016-06-02 Thread Tom Morris

On Wednesday, June 1, 2016 at 9:14:21 AM UTC-4, Rob Shanks wrote: > > I am trying to use Tesseract to recognise a 6 digit number from scanned > documents. Because they are scanned the numbers can be faded but I know > that they are 6 digit. The scanned documents have been destroyed long ago >

[tesseract-ocr] Re: Quality and delay

2016-06-02 Thread Tom Morris

On Thursday, June 2, 2016 at 2:38:32 AM UTC-4, Guilherme Galdino Siqueira wrote: > > > I have an issue with bad quality of recognition of text with high delay of > finishing the process. I've tried to binarize the source image, reduce its > size, change environment light to take the picture,

[tesseract-ocr] Re: invalid floating point operation when calling TessBaseAPIAnalyseLayout

2016-06-11 Thread Tom Morris

What version of Tesseract? Source or binary? If source, compiled with what compiler & compilation options? What operating system? What processor architecture? What's the stack trace for the crash? On Friday, June 10, 2016 at 6:00:19 AM UTC-4, Matthias Schneider wrote: > > Hi, > > I'm trying to

[tesseract-ocr] Re: Support Company for Tesseract

2016-06-14 Thread Tom Morris

Bojidar - Google isn't going to provide support for 3rd party use of a project that they've open sourced. Sumant - There are a number of developers on this list who've used Tesseract in their own applications or created custom OCR solutions based on it. One or more of them may be willing to

[tesseract-ocr] Re: invalid floating point operation when calling TessBaseAPIAnalyseLayout

2016-06-13 Thread Tom Morris

On Monday, June 13, 2016 at 10:22:32 AM UTC-4, Matthias Schneider wrote: > > I'm using latest dev version 3.05.00dev and I used peirick/leptonica ( > https://github.com/peirick/leptonica) to build libtesseract.dll and > liblept.dll with Visual Studio 2015. > However, the resulting DLLs I'm using

[tesseract-ocr] Re: How to get intermediate result of tesseract? Like processed image as output.

2016-06-24 Thread Tom Morris

Rather than rolling your own, I'd suggest looking at the ScrollView app to see if it has the information that you need: https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging Tom On Wednesday, June 22, 2016 at 7:16:35 PM UTC-4, Quan Nguyen wrote: > > You'll need to programmatically

[tesseract-ocr] Re: Assistance with OCR on frames from screen capture

2016-06-24 Thread Tom Morris

I'd also get rid of the blue and make the text black on white. Presumably the colors are constant, so this should be an easy pre-processing step. Tom On Friday, June 24, 2016 at 4:01:42 AM UTC-4, Stef wrote: > > > You could try and scale up the image before OCR. See section "Scale text > up"

[tesseract-ocr] Re: PDF/A versions

2016-01-16 Thread Tom Morris

On Wednesday, January 13, 2016 at 3:52:48 PM UTC-5, John Scancella wrote: > > I tried searching but couldn't find which versions of PDF/A (if any) > tesseract supports. Specifically I have a requirement for PDF/A-2a > generation, but I couldn't find anywhere if tesseract can write PDF/A-2a >

[tesseract-ocr] Re: hocr's line baseline

2016-06-25 Thread Tom Morris

Hi Stef. Is that info hiding in the wiki somewhere? If not, do you think you could find a place to add it? Tom On Saturday, June 25, 2016 at 4:41:23 PM UTC-4, Stef wrote: > > The two numbers are the slope (1st number) and constant term (2nd number) > of a linear equation describing the

Re: [tesseract-ocr] Re: How to get intermediate result of tesseract? Like processed image as output.

2016-06-25 Thread Tom Morris

waiting for > server... waiting for server... it's like it goes in infinity loop. > can you suggest solution for it? > > > On Friday, June 24, 2016 at 10:35:27 PM UTC+5:30, Tom Morris wrote: >> >> Rather than rolling your own, I'd suggest looking at the ScrollView app

[tesseract-ocr] Re: [ask] unrecognized text in particular layout

2016-06-26 Thread Tom Morris

On Saturday, June 25, 2016 at 2:45:51 PM UTC-4, denny.m...@gmail.com wrote: > > > could anyone kindly explain why the text "YTOVWG" in ocr3.jpg is not > recognized > but can be recognized in ocr4.jpg ? > As a guess, I'd say because the mixed fonts on a single line and large gap before the

[tesseract-ocr] Re: How to help tesseract identify the character in this image?

2016-06-26 Thread Tom Morris

On Sunday, June 26, 2016 at 7:22:20 AM UTC-4, Arunabh Ghosh wrote: > > I have uploaded an image which has the character 'e' written in it. I had > preprocessed the original image using opencv2 but still couldn't get > tesseract to recognize the character. > > Any suggestions on how to get

[tesseract-ocr] Re: OCR text problem using -psm 6 hocr

2016-01-28 Thread Tom Morris

On Thursday, January 28, 2016 at 4:24:26 AM UTC-5, Gunasekaran Velu wrote: I am using following tesseract command to do the HOCR file for bmp image > > >tesseract.exe "test.bmp" Test -l eng -psm 6 hocr > > Marked area in the attached image does not come in to Test.hocr.html file. > except

[tesseract-ocr] Re: PDF/A versions

2016-01-29 Thread Tom Morris

I just stumbled across https://github.com/jbarlow83/OCRmyPDF which claims to use Tesseract and provide PDF/A support. Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send

[tesseract-ocr] Re: New fast implementation of Sauvola binarizer

2016-01-31 Thread Tom Morris

On Saturday, January 30, 2016 at 5:46:31 PM UTC-5, Ilya Mezhirov wrote: > > > I've written a binarizer that does Sauvola a lot faster than Leptonica. It > also can double the resolution. > Works strictly with JPEGs and outputs G4-compressed TIFFs. > > Check it out:

Re: [tesseract-ocr] Modyfying existing traineddata

2016-02-24 Thread Tom Morris

On Tuesday, February 23, 2016 at 5:40:33 PM UTC-5, Devon Yoo wrote: > > > Is there a way to give TesseractEngine a hint of expected text format? For > example, can I set a format like 00XXX00 XX-000 where 0 represents number > and X represents alphabet? > See the answer to this question:

[tesseract-ocr] Re: Please teach me What is improved ver. 3.04.

2016-02-25 Thread Tom Morris

, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid There are a total of 107 languages supported now. On Thursday, February 25, 2016 at 3:41:00 PM UTC-5, Tom Morris wrote: > > On Thursday, February 25, 2016 at 3:32:10 AM UTC-5, 기옥주 wrote: >> >> I wonder what is improved ver. 3.04. more

1 2 3 >

1 - 100 of 227 matches

Mail list logo