Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2019-04-28 Thread Giriraj Bhojak
Hi Shree, Does this mean there is a bug in tesseract 4 and should I create one in GitHub for two columns text with default psm? Also, could you please expand on what you meant by ' other means of selecting text region' ? Is there anything in tesseract that I can try to identify text regions ?

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2019-04-26 Thread Shree Devi Kumar
I did not post the command that I used, it was probably with default psm and code as of April 2017. If you really want to investigate, use the commit from master branch as of that time and test. In theory tesseract 4 should recognize two columns with the default psm. But there seem to be some

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2019-04-26 Thread Giriraj Bhojak
Hi Shree, I just tried the v3.05.02 as well for different modes and I still couldn't produce the output as you posted with the image file. I am wondering if I am doing anything wrong. Here is the command I have run for the v3.05.02 tesseract and changed psm mode from 1 to 13:

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2019-04-26 Thread Shree Devi Kumar
@zdenko Please check this image (from the first post) with 3.0x and current 4.0x code to see if there is a regression in terms of recognition of 2 columns. On Fri, Apr 26, 2019 at 10:25 PM Giriraj Bhojak wrote: > Thank you, I will try it out next. > I wanted to use version 4 of tesseract since

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2019-04-26 Thread Giriraj Bhojak
Thank you, I will try it out next. I wanted to use version 4 of tesseract since it uses LSTM based OCR engine. Higher accuracy is one of the essential requirements for my usecase. Would you know if v4 supports extracting text from a two column text structure image file at all? Thank you for

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2019-04-26 Thread Shree Devi Kumar
April 2017 - It is probably the 3.0x version. Try the 3.05 branch. https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01 3.05.01 Release [image: @zdenop] zdenop

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2019-04-26 Thread Giriraj Bhojak
Hi Shree, Thank you for quick response. I used the trained data by downloading the datasets at https://github.com/tesseract-ocr/tessdata, https://github.com/tesseract-ocr/tessdata_best and https://github.com/tesseract-ocr/tessdata_fast. I ran following commands for each of these datasets and

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2019-04-26 Thread Shree Devi Kumar
Which eng.traineddata did you use? There are three options >From tessdata, tessdata_best and tessdata_fast. On Fri, 26 Apr 2019, 09:19 Giriraj Bhojak, wrote: > Hello Shree, > > I realize this post is more than two years old now, but would appreciate > any help. > I tried your suggestion on the

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2017-04-21 Thread ShreeDevi Kumar
If you want to OCR an invoice like the sample you posted, just use the eng.traineddata and OCR the page. You do not need to do any training. Here is the output I get 8633 0410 NO RP 11 07122015 NYNN 01 01 0001 Page 2 Of 3 Did you know? Your Comcast Business Internet service gives