[tesseract-ocr] Re: Detect Vertical words and remove from the images

2017-05-19 Thread Brian51
On Wednesday, May 17, 2017 at 10:43:12 PM UTC+2, akhil katpally wrote: > > Hello group > > I have an document image containing mostly text in horizontal line, but > some text (very less amount) is in vertical line often some of it > overlapping on the horizontal text. I am attaching the image,

Re: [tesseract-ocr] Training from scratch

2017-05-19 Thread ShreeDevi Kumar
Google has not shared its method of training with complete scripts etc. The training instructions on wiki are only a tutorial for learning about LSTM training. Please also see https://github.com/tesseract-ocr/tesseract/issues/644 ShreeDevi -- You received this message because you are

[tesseract-ocr] Extract font size,style,colour from an image

2017-05-19 Thread mandar bandodekar
Hi , Is it possible to extract font colour, font style(Bold , italic), size using Tesseract-ocr? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

[tesseract-ocr] Re: jcopy-paste an image provides different ocr results

2017-05-19 Thread Youcef
Ok i got it : i should copy paste the uzn file too! Le jeudi 18 mai 2017 13:38:30 UTC+2, Youcef a écrit : > > Hi everybody, > > I'm simple copy paste an image in linux like that: > > cp 7703_062.3B.tif 7703_062_copy.3B.tif > > and then run tesseract for each with exactly same command line : > >

Re: [tesseract-ocr] Extract font size,style,colour from an image

2017-05-19 Thread Zdenko Podobný
tesseract 3.05 (the current stable version) has ability to detect some font characteristic, but it is not perfect (e.g. not color detection because OCR is run on binarized images). You can test with hocr or play with API(ResultIterator and WordFontAttributes). Zdenko On Fri, May 19, 2017 at 2:43

[tesseract-ocr] Tesseract 3.02 fails to identify some characters

2017-05-19 Thread Thilina Jayathilaka
I'm working on a c++ project where I need to OCR some text fields. I'm using Tesseract version 3.02 c++ API functions to achieve this. But the OCR results differ from the image. The following image reads as "31 SW19 SQU" when I use GetUTF8Text() function. [image: "31 SW19 SQU"]

[tesseract-ocr] Re: Extract font size,style,colour from an image

2017-05-19 Thread akhil katpally
tesseract image.tiff image.txt -c tessedit_debug_fonts=1 ... this would give you the font type and the confidence of its font type. On Friday, May 19, 2017 at 5:55:17 AM UTC-7, mandar bandodekar wrote: > > Hi , > Is it possible to extract font colour, font style(Bold , italic), size > using

[tesseract-ocr] Re: Extract font size,style,colour from an image

2017-05-19 Thread akhil katpally
it gives at the character level On Friday, May 19, 2017 at 11:25:58 AM UTC-7, akhil katpally wrote: > > tesseract image.tiff image.txt -c tessedit_debug_fonts=1 ... this would > give you the font type and the confidence of its font type. > > On Friday, May 19, 2017 at 5:55:17 AM UTC-7,

[tesseract-ocr] Re: Tesseract 3.02 fails to identify some characters

2017-05-19 Thread akhil katpally
Don't know exactly why it is recognizing incorrectly but, here is what i would suggest to try ... Try tesseract 4.0 with neural network, i found much better than original tesseract... try to see bounding boxes on each character ... it may give you an idea .. On Friday, May 19, 2017 at

[tesseract-ocr] Re: Detect Vertical words and remove from the images

2017-05-19 Thread akhil katpally
Agreed, it seems the owner of the group needs to give me permission to edit the post. Just to inform, these images are from the internet and the courts have posted it. I am just reusing it. On Friday, May 19, 2017 at 1:26:20 AM UTC-7, Brian51 wrote: > > > > On Wednesday, May 17, 2017 at

Re: [tesseract-ocr] Re: Extract font size,style,colour from an image

2017-05-19 Thread Zdenko Podobný
Unfortunately there is possibility only to delete message - which I did a moment ago. Zdenko On Fri, May 19, 2017 at 8:26 PM, akhil katpally wrote: > it gives at the character level > > On Friday, May 19, 2017 at 11:25:58 AM UTC-7, akhil katpally wrote: >> >>

[tesseract-ocr] How I can add box file to tesseract 4 LSTM training?

2017-05-19 Thread Ahmad Moawad
Hello All, I want to train tesseract 4.0 LSTM for an image templates so I corrected the box file and I don't know how I include this file to tesseract training training/tesstrain.sh \ --fonts_dir /usr/share/fonts \ --training_text ../langdata/ara/ara.training_text \ --langdata_dir ../langdata

Re: [tesseract-ocr] Training from scratch

2017-05-19 Thread aggiedude
I have already been going through language-specific.sh but I still have a few questions I hope someone can answer. My initial question I guess is where there other tools used to create the training data for the English model that is currently provided? (other than the ones provided on git?)

[tesseract-ocr] Re: Any hints for Arabic user custom traineddata (e.g. new font)

2017-05-19 Thread Ahmad Moawad
Hi, Would you mind if you share the corpus, my situation is similar to your. Do you plan for Fine Tuning OR

[tesseract-ocr] Training from scratch

2017-05-19 Thread aggiedude
If trainin tesseract 4 from scratch, English for example. I know I need to have the proper fonts installed, but what other parameters would be needed to produce the same model for English? Ie what exposure settings were used to degrade images etc? -- You received this message because you are

Re: [tesseract-ocr] Training from scratch

2017-05-19 Thread ShreeDevi Kumar
As per Ray 4500 fonts and 40 lines of text were used to create the models of latin scriipt based languages. So I am not sure whether you can replicate the model. For language specific exposure settings etc see