Re: [tesseract-ocr] lt-lstmtraining: genericvector.h:720: T& GenericVector::operator[](int) const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.

2018-11-27 Thread Shree Devi Kumar
You make a good point, zdenko. If there are limitations on training data to be used or minimum memory requirements for handling such data for doing custom training, it will be good to document them in the wiki, so that people do not waste time and effort in training if they don't have the minimum

Re: [tesseract-ocr] Missing characters from PDF file because incorrect size of boundingbox from PageIterator::BoundingBox() function

2018-11-27 Thread wialsh w
c此前我在使用hocr输出,也遇到这种情况 w我的解决方案是,重新编译tesseract 我想可能因为之前使用了best模型bbox模型替换了原有的,导致不兼容,因此后面在对模型进行替换的时候,先备份最开始的编译的数据 b不知道你的情况是否与之类似。 Hwa Chuang 于2018年11月27日周二 上午3:26写道: > I was using Tesseract v4 to generate PDF file and found some of string > can't be searched because of missing characters in PDF

[tesseract-ocr] Re: Tesseract training has an upper limit on the use of cpu?Is the more cpu, the faster the training?

2018-11-27 Thread bruce
Hi Junye Li, I hava an workstation with 36 core(2.0Ghz) and 24G Memory ,RHEL system I'm now running text2image to generate tif/box ,I guess it still needs to be executed for a week. Next,I will run tesseract to generate .lstm files , I guess it will take about two weeks.

Re: [tesseract-ocr] lt-lstmtraining: genericvector.h:720: T& GenericVector::operator[](int) const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.

2018-11-27 Thread Zdenko Podobny
Shree, issue tracker is not for custom training. Simply because there is not enough people and it can not be reproduced... Did you read: "I have been runnig about 130G data which are 4000 files"? Unless you are not able to reproduce problem with very small data, there is IMO nobody would be

[tesseract-ocr] Re: Tesseract training has an upper limit on the use of cpu?Is the more cpu, the faster the training?

2018-11-27 Thread Junye Li
I don't think that would be the case unless your training text is few hundred megabytes in size... I am running Tesseract on Ubuntu 18.04 and based a very quick test it turned out Tesseract on Ubuntu performed better than on Windows in terms of agreement accuracy (I'm training it for

[tesseract-ocr] Re: Tesseract joins characters that are not touching

2018-11-27 Thread Mohit Jain
Hi, Can you tell me how did you extract the binary-intermediate image created by Tesseract? On Saturday, June 18, 2016 at 10:16:57 PM UTC+5:30, Julian Einhaus wrote: > > Hi, > I am trying to read three lines of text on a well defined image (pretty > much no background noise, characters

Re: [tesseract-ocr] lt-lstmtraining: genericvector.h:720: T& GenericVector::operator[](int) const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.

2018-11-27 Thread Shree Devi Kumar
In my opinion, the assert still needs to be documented as an issue, with LSTM training. On Tue, 27 Nov 2018, 05:03 Zdenko Podobny Shree, > > issue tracker is not for custom training. Simply because there is not > enough people and > it can not be reproduced... > Did you read: "I have been

Re: [tesseract-ocr] lt-lstmtraining: genericvector.h:720: T& GenericVector::operator[](int) const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.

2018-11-27 Thread Zdenko Podobny
Yes, you can ;-) If you want to document it, you need to find reason for error. If you want to find reason you need to dive in 130Gb of input data... Enjoy. IMO right suggestion is to ask user to find file/data that cause problem and create minimal input data that demonstrate problem. Creating