from:"ElGato ElMago"

Re: [tesseract-ocr] Trained data for E13B font

2019-09-19 Thread ElGato ElMago

Hello,

CMC-7 is totally a different font than E13B.  It's only E13B around 
myself.  I've never seen CMC-7 in person.

I had about 100 sample checks and used a check reading machine, one of 
those at banks.  Thus they're in the same image quality and character 
quality.

Although it's a small sample, there was no phantom character, no 
wrong-reading on symbols, nor on numerics in the end.  There was one 
isolated word with two characters that had been skipped.  Number of spaces 
between words tend to be shorter than real, which causes no problem in 
parsing.

I'm sort of done at the moment.  Not going for extensive training.  I'd 
think you could improve the training text for CMC-7.  The training with 
neural network (LSTM) works like a magic but it somewhat depends on how the 
training text is prepared.  I analyzed bad boxing with hocr output and put 
those patterns more in the training text.

Hope this helps.

ElMagoElGato

2019年9月17日火曜日 14時44分59秒 UTC+9 Mamadou:

> Hello,
>
>
> Thanks again for sharing your E-13B traineddata, it was helpful. 
> We’ve managed to get good accuracy for E-13B with Tesseract but failed with 
> CMC-7. So, we ended using TensorFlow for both fonts.
>
> I’m curious to know which level of accuracy you’ve reached. You can check our 
> accuracy for Tesseract using app at 
> https://github.com/DoubangoTelecom/tesseractMICR#the-recognizer-app. For 
> Tensorflow at https://www.doubango.org/webapps/micr/. 
>
> Also, have you tried with real life samples (e.g. random images from Google 
> search)? Why are you including the SPACE in your charset and training data? 
> It makes the convergence harder.
>
> As promised, the dataset is hosted at 
> https://github.com/DoubangoTelecom/tesseractMICR
>
>
> On Friday, August 9, 2019 at 10:40:15 AM UTC+2, ElGato ElMago wrote:
>>
>> I added eng.traineddata and LICENSE.  I used my account name in the 
>> license file.  I don't know if it's appropriate or not.  Please tell me if 
>> it's not.
>>
>> 2019年8月9日金曜日 16時17分41秒 UTC+9 Mamadou:
>>>
>>>
>>>
>>> On Friday, August 9, 2019 at 7:31:03 AM UTC+2, ElGato ElMago wrote:
>>>>
>>>> Here's my sharing on GitHub.  Hope it's of any use for somebody.
>>>>
>>>> https://github.com/ElMagoElGato/tess_e13b_training
>>>>
>>> Thanks for sharing your experience with us.
>>> Is it possible to share your Tesseract model (xxx.traineddata)?
>>> We're building a dataset using real life images like what we have 
>>> already done for MRZ (
>>> https://github.com/DoubangoTelecom/tesseractMRZ/tree/master/dataset).
>>> Your model would help us to automated the annotation and will speedup 
>>> our devs. Off course we'll have to manualy correct the annotations but it 
>>> will be faster for us. 
>>> Also, please add a license to your repo so that we know if we have right 
>>> to use it
>>>
>>>>
>>>>
>>>> 2019年8月8日木曜日 9時35分17秒 UTC+9 ElGato ElMago:
>>>>>
>>>>> OK, I'll do so.  I need to reorganize naming and so on a little bit.  
>>>>> Will be out there soon.
>>>>>
>>>>> 2019年8月7日水曜日 21時11分01秒 UTC+9 Mamadou:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wednesday, August 7, 2019 at 2:36:52 AM UTC+2, ElGato ElMago wrote:
>>>>>>>
>>>>>>> HI,
>>>>>>>
>>>>>>> I'm thinking of sharing it of course.  What is the best way to do 
>>>>>>> it?  After all this, the contribution part of mine is only how I 
>>>>>>> prepared 
>>>>>>> the training text.  Even that is consist of Shree's text and mine.  The 
>>>>>>> instructions and tools I used already exist.
>>>>>>>
>>>>>> If you have a Github account just create a repo and publish the data 
>>>>>> and instructions. 
>>>>>>
>>>>>>>
>>>>>>> ElMagoElGato
>>>>>>>
>>>>>>> 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>> Are you planning to release the dataset or models?
>>>>>>>> I'm working on the same subject and planning to share both under 
>>>>>>>> BSD terms
>>>>>>>>
>>>>>>>> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago 
>>>>>>>> wrote:
>>>>>>>>>
>>>

[tesseract-ocr] Re: Tesseract.js and traineddata language.

2019-09-04 Thread ElGato ElMago

Why don't you ask questions over there?  I guess you've been advised so.

2019年9月5日木曜日 3時24分02秒 UTC+9 Clint William Theron:
>
> Intuitively I know it answers my question but I fail to see the answer. 
> Here's what went through my mind as I read your link: "I think a gitpod is 
> a node.js server so that means the file shuld be in the fs where the 
> command was executed. The command got executed in the demo.html file which 
> is located in the browser directory but there is no .traineddata in that 
> folder. maybe the command got executed in the /dist directory because at 
> the beginning of the script we included the following
>
> 
>
> but if so I don't see the directory in the project..."
>
> it's about here where I get lost. I now think maybe I should declare the 
> langPath but this I did and I already told you what happens...
>
> guys help me out here because after I get this right I still need to work 
> on my .traineddata file itself. It's working but it's limited. I just made 
> it to get started...
>
> Thanks already :-)
>
> On Wednesday, September 4, 2019 at 6:54:39 PM UTC+2, Clint William Theron 
> wrote:
>>
>> Actually, that doesn't answer my question. It only says where tesseract 
>> stores the .traineddata file after download and not how to set the 
>> langPath. I tried to set the langPath like this:
>>
>> const worker = new TesseractWorker({
>>  corePath: '../../node_modules/tesseract.js-core/tesseract-core.wasm.js',
>>  langPath: lang_path
>>  });
>>
>>
>> worker.recognize(file,
>> 'cus'
>> )
>> .progress(function(packet){
>> console.info(packet)
>> progressUpdate(packet)
>>
>> })
>> .then(function(data){
>> console.log(data)
>> progressUpdate({ status: 'done', data: data })
>> })
>>
>>
>>
>> but I get errors: 
>> * Error opening data file ./cus.traineddata
>> * Please make sure the TESSDATA_PREFIX environment variable is set to 
>> your "tessdata" directory.
>>
>> Thanks for anything...
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8f1e2366-ebb6-4308-a8a8-f2069fc98a37%40googlegroups.com.

Re: [tesseract-ocr] Re: Best Trained data for Non MRZ data

2019-08-26 Thread ElGato ElMago

It's just an idea but tessdata_best seems to fit better than tessdata if 
you specify OEM_LSTM_ONLY.  Did you try that?

2019年8月26日月曜日 14時21分14秒 UTC+9 Tintu Jacob:
>
> Please find the result obtained from 
> tesseract
>
> Check Nationality,dob n expiry date
>
>  <>  
>
> STATE oF KUWAIT evi no = 5%
>
> CDN 285031504457 wat hon
>
> Pallas La; daa ~~
> Name MOHAMMAD RAHAT
> ABDUL KHALIQ
> ON +L yaonaity IND SA a
> < b Sex " Aa so i
>
> Burth Date 1503/1985 SA fol
> EpayDate  {8/0412020 LST pL
>
>
> Please find the code to do ocr in tesseract
>
> Tesseract instance = new Tesseract();
> // SET THE TESSDATA PATH
> instance.setDatapath(tesseractPath);
>
> 
> instance.setOcrEngineMode(TessOcrEngineMode.OEM_LSTM_ONLY);
> instance.setLanguage("eng");
> instance.setPageSegMode(TessPageSegMode.PSM_AUTO);
>
> instance.setTessVariable("load_freq_dawg", "true");
> instance.setTessVariable("load_system_dawg", 
> "true");
>
> 
> instance.setTessVariable("tessedit_char_whitelist","AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz0123456789/<");
>
> data = instance.doOCR(image);
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fccd8932-c7bd-4165-91f5-c3615a584da3%40googlegroups.com.

Re: [tesseract-ocr] Re: Best Trained data for Non MRZ data

2019-08-22 Thread ElGato ElMago

What did you do and what was your result?

2019年8月23日金曜日 4時42分44秒 UTC+9 Tintu Jacob:
>
> Please find sample file
>
> https://drive.google.com/file/d/0B93Vnm9ZxkpyUnd4V0VxdUg1QmV6NGVNMWVwcFpuWGxLVjE0/view?usp=drivesdk
>
> And we r trying only kuwait civil id card and its  ocr accuracy in non mrz 
> page(front page ) is less.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/beb06a3e-5013-46af-bf06-97ff03e127ab%40googlegroups.com.

[tesseract-ocr] Re: Best Trained data for Non MRZ data

2019-08-20 Thread ElGato ElMago

It isn't OCRB then.  Pick your local language from the best traineddata.  
You can try that first.

2019年8月21日水曜日 0時49分17秒 UTC+9 Tintu Jacob:
>
> Hi 
>
> We are trying to read national id card using tesseract and able to read 
> mrz side of id card image. But unable to read non mrz side of card using 
> this trained data. Could you please share which is the best trained data to 
> read non mrz data from id cards? 
>
>
>  Reagrds 
> Tintu

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e4887860-0a37-4039-9e27-217b88b86dfa%40googlegroups.com.

Re: [tesseract-ocr] Trained data for E13B font

2019-08-12 Thread ElGato ElMago

So I did both renaming and adding a link in the wiki page.

2019年8月10日土曜日 0時35分14秒 UTC+9 shree:
>
> I suggest to rename the traineddata file from eng. to e13b or another 
> similar descriptive name and also add a link to it in the data file 
> contributions wiki page.
>
> On Fri, 9 Aug 2019, 20:08 'Mamadou' via tesseract-ocr, <
> tesser...@googlegroups.com > wrote:
>
>>
>>
>> On Friday, August 9, 2019 at 10:40:15 AM UTC+2, ElGato ElMago wrote:
>>>
>>> I added eng.traineddata and LICENSE.  I used my account name in the 
>>> license file.  I don't know if it's appropriate or not.  Please tell me if 
>>> it's not.
>>>
>> It's ok.
>> Thanks. I'll share our dataset (real life samples) in the coming days. 
>>
>>>
>>> 2019年8月9日金曜日 16時17分41秒 UTC+9 Mamadou:
>>>>
>>>>
>>>>
>>>> On Friday, August 9, 2019 at 7:31:03 AM UTC+2, ElGato ElMago wrote:
>>>>>
>>>>> Here's my sharing on GitHub.  Hope it's of any use for somebody.
>>>>>
>>>>> https://github.com/ElMagoElGato/tess_e13b_training
>>>>>
>>>> Thanks for sharing your experience with us.
>>>> Is it possible to share your Tesseract model (xxx.traineddata)?
>>>> We're building a dataset using real life images like what we have 
>>>> already done for MRZ (
>>>> https://github.com/DoubangoTelecom/tesseractMRZ/tree/master/dataset).
>>>> Your model would help us to automated the annotation and will speedup 
>>>> our devs. Off course we'll have to manualy correct the annotations but it 
>>>> will be faster for us. 
>>>> Also, please add a license to your repo so that we know if we have 
>>>> right to use it
>>>>
>>>>>
>>>>>
>>>>> 2019年8月8日木曜日 9時35分17秒 UTC+9 ElGato ElMago:
>>>>>>
>>>>>> OK, I'll do so.  I need to reorganize naming and so on a little bit.  
>>>>>> Will be out there soon.
>>>>>>
>>>>>> 2019年8月7日水曜日 21時11分01秒 UTC+9 Mamadou:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wednesday, August 7, 2019 at 2:36:52 AM UTC+2, ElGato ElMago 
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> HI,
>>>>>>>>
>>>>>>>> I'm thinking of sharing it of course.  What is the best way to do 
>>>>>>>> it?  After all this, the contribution part of mine is only how I 
>>>>>>>> prepared 
>>>>>>>> the training text.  Even that is consist of Shree's text and mine.  
>>>>>>>> The 
>>>>>>>> instructions and tools I used already exist.
>>>>>>>>
>>>>>>> If you have a Github account just create a repo and publish the data 
>>>>>>> and instructions. 
>>>>>>>
>>>>>>>>
>>>>>>>> ElMagoElGato
>>>>>>>>
>>>>>>>> 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>> Are you planning to release the dataset or models?
>>>>>>>>> I'm working on the same subject and planning to share both under 
>>>>>>>>> BSD terms
>>>>>>>>>
>>>>>>>>> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago 
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> FWIW, I got to the point where I can feel happy with the 
>>>>>>>>>> accuracy. As the images of the previous post show, the symbols, 
>>>>>>>>>> especially 
>>>>>>>>>> on-us symbol and amount symbol, were causing mix-up each other or to 
>>>>>>>>>> another character.  I added much more more symbols to the training 
>>>>>>>>>> text and 
>>>>>>>>>> formed words that start with a symbol.  One example is as follows:
>>>>>>>>>>
>>>>>>>>>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=;
>>>>>>>>>>
>>>>>>>>>>
>>>&g

Re: [tesseract-ocr] Trained data for E13B font

2019-08-09 Thread ElGato ElMago

I added eng.traineddata and LICENSE.  I used my account name in the license 
file.  I don't know if it's appropriate or not.  Please tell me if it's not.

2019年8月9日金曜日 16時17分41秒 UTC+9 Mamadou:
>
>
>
> On Friday, August 9, 2019 at 7:31:03 AM UTC+2, ElGato ElMago wrote:
>>
>> Here's my sharing on GitHub.  Hope it's of any use for somebody.
>>
>> https://github.com/ElMagoElGato/tess_e13b_training
>>
> Thanks for sharing your experience with us.
> Is it possible to share your Tesseract model (xxx.traineddata)?
> We're building a dataset using real life images like what we have already 
> done for MRZ (
> https://github.com/DoubangoTelecom/tesseractMRZ/tree/master/dataset).
> Your model would help us to automated the annotation and will speedup our 
> devs. Off course we'll have to manualy correct the annotations but it will 
> be faster for us. 
> Also, please add a license to your repo so that we know if we have right 
> to use it
>
>>
>>
>> 2019年8月8日木曜日 9時35分17秒 UTC+9 ElGato ElMago:
>>>
>>> OK, I'll do so.  I need to reorganize naming and so on a little bit.  
>>> Will be out there soon.
>>>
>>> 2019年8月7日水曜日 21時11分01秒 UTC+9 Mamadou:
>>>>
>>>>
>>>>
>>>> On Wednesday, August 7, 2019 at 2:36:52 AM UTC+2, ElGato ElMago wrote:
>>>>>
>>>>> HI,
>>>>>
>>>>> I'm thinking of sharing it of course.  What is the best way to do it?  
>>>>> After all this, the contribution part of mine is only how I prepared the 
>>>>> training text.  Even that is consist of Shree's text and mine.  The 
>>>>> instructions and tools I used already exist.
>>>>>
>>>> If you have a Github account just create a repo and publish the data 
>>>> and instructions. 
>>>>
>>>>>
>>>>> ElMagoElGato
>>>>>
>>>>> 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou:
>>>>>
>>>>>> Hello,
>>>>>> Are you planning to release the dataset or models?
>>>>>> I'm working on the same subject and planning to share both under BSD 
>>>>>> terms
>>>>>>
>>>>>> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> FWIW, I got to the point where I can feel happy with the accuracy. 
>>>>>>> As the images of the previous post show, the symbols, especially on-us 
>>>>>>> symbol and amount symbol, were causing mix-up each other or to another 
>>>>>>> character.  I added much more more symbols to the training text and 
>>>>>>> formed 
>>>>>>> words that start with a symbol.  One example is as follows:
>>>>>>>
>>>>>>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=;
>>>>>>>
>>>>>>>
>>>>>>> I randomly made 8,000 lines like this.  In fine-tuning from eng, 
>>>>>>> 5,000 iteration was almost good.  Amount symbol still is confused a 
>>>>>>> little 
>>>>>>> when it's followed by 0.  Fine tuning tends to be dragged by small 
>>>>>>> particles.  I'll have to think of something to make further improvement.
>>>>>>>
>>>>>>> Training from scratch produced a bit more stable traineddata.  It 
>>>>>>> doesn't get confused with symbols so often but tends to generate extra 
>>>>>>> spaces.  By 10,000 iterations, those spaces are gone and recognition 
>>>>>>> became 
>>>>>>> very solid.
>>>>>>>
>>>>>>> I thought I might have to do image and box file training but I guess 
>>>>>>> it's not needed this time.
>>>>>>>
>>>>>>> ElMagoElGato
>>>>>>>
>>>>>>> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago:
>>>>>>>>
>>>>>>>> HI,
>>>>>>>>
>>>>>>>> Well, I read the description of ScrollView (
>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) 
>>>>>>>> and it says:
>>>>>>>>
>>>>>>>> To show the characters, deselect DISPLAY/Bounding Boxes, select 
>>>>>>>> DISPLAY/Polygonal Approx and t

[tesseract-ocr] Re: tesseract output is of first page only

2019-08-09 Thread ElGato ElMago

Is it possible to have multiple pages in a png file in the first place?

2019年8月9日金曜日 14時41分15秒 UTC+9 ilevy:
>
> I'm trying tesseract for the first time with a png of a multipage document 
> I saved out of a pdf (which itself was just an image).
>
> When I run tesseract, I get an output of the first page, but that's all. I 
> notice that there's a control-L (^L) at the end of the text file.
>
> How do I get the entire file output to txt?
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bc374ce7-9b75-4c96-b325-a4c91e64d2ec%40googlegroups.com.

Re: [tesseract-ocr] Trained data for E13B font

2019-08-08 Thread ElGato ElMago

Here's my sharing on GitHub.  Hope it's of any use for somebody.

https://github.com/ElMagoElGato/tess_e13b_training

2019年8月8日木曜日 9時35分17秒 UTC+9 ElGato ElMago:
>
> OK, I'll do so.  I need to reorganize naming and so on a little bit.  Will 
> be out there soon.
>
> 2019年8月7日水曜日 21時11分01秒 UTC+9 Mamadou:
>>
>>
>>
>> On Wednesday, August 7, 2019 at 2:36:52 AM UTC+2, ElGato ElMago wrote:
>>>
>>> HI,
>>>
>>> I'm thinking of sharing it of course.  What is the best way to do it?  
>>> After all this, the contribution part of mine is only how I prepared the 
>>> training text.  Even that is consist of Shree's text and mine.  The 
>>> instructions and tools I used already exist.
>>>
>> If you have a Github account just create a repo and publish the data and 
>> instructions. 
>>
>>>
>>> ElMagoElGato
>>>
>>> 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou:
>>>
>>>> Hello,
>>>> Are you planning to release the dataset or models?
>>>> I'm working on the same subject and planning to share both under BSD 
>>>> terms
>>>>
>>>> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> FWIW, I got to the point where I can feel happy with the accuracy. As 
>>>>> the images of the previous post show, the symbols, especially on-us 
>>>>> symbol 
>>>>> and amount symbol, were causing mix-up each other or to another 
>>>>> character.  
>>>>> I added much more more symbols to the training text and formed words that 
>>>>> start with a symbol.  One example is as follows:
>>>>>
>>>>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=;
>>>>>
>>>>>
>>>>> I randomly made 8,000 lines like this.  In fine-tuning from eng, 5,000 
>>>>> iteration was almost good.  Amount symbol still is confused a little when 
>>>>> it's followed by 0.  Fine tuning tends to be dragged by small particles.  
>>>>> I'll have to think of something to make further improvement.
>>>>>
>>>>> Training from scratch produced a bit more stable traineddata.  It 
>>>>> doesn't get confused with symbols so often but tends to generate extra 
>>>>> spaces.  By 10,000 iterations, those spaces are gone and recognition 
>>>>> became 
>>>>> very solid.
>>>>>
>>>>> I thought I might have to do image and box file training but I guess 
>>>>> it's not needed this time.
>>>>>
>>>>> ElMagoElGato
>>>>>
>>>>> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago:
>>>>>>
>>>>>> HI,
>>>>>>
>>>>>> Well, I read the description of ScrollView (
>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) and 
>>>>>> it says:
>>>>>>
>>>>>> To show the characters, deselect DISPLAY/Bounding Boxes, select 
>>>>>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display.
>>>>>>
>>>>>>
>>>>>> It basically works.  But for some reason, it doesn't work on my e13b 
>>>>>> image and ends up with a blue screen.  Anyway, it shows each box 
>>>>>> separately 
>>>>>> when a character is consist of multiple boxes.  I'd like to show the box 
>>>>>> for the whole character.  ScrollView doesn't do it, at least, yet.  I'll 
>>>>>> do 
>>>>>> it on my own.
>>>>>>
>>>>>> ElMagoElGato
>>>>>>
>>>>>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>> I got this result from hocr.  This is where one of the phantom 
>>>>>>> characters comes from.
>>>>>>>
>>>>>>> 
>>>>>>> ;
>>>>>>>
>>>>>>>
>>>>>>> The firs character is the phantom.  It starts with the second 
>>>>>>> character that exists on x axis.  The first character only has 3 points 
>>>>>>> width.  I attach ScrollView screen shots that visualize this.
>>>>>>>
>>>>>&

Re: [tesseract-ocr] Trained data for E13B font

2019-08-07 Thread ElGato ElMago

OK, I'll do so.  I need to reorganize naming and so on a little bit.  Will 
be out there soon.

2019年8月7日水曜日 21時11分01秒 UTC+9 Mamadou:
>
>
>
> On Wednesday, August 7, 2019 at 2:36:52 AM UTC+2, ElGato ElMago wrote:
>>
>> HI,
>>
>> I'm thinking of sharing it of course.  What is the best way to do it?  
>> After all this, the contribution part of mine is only how I prepared the 
>> training text.  Even that is consist of Shree's text and mine.  The 
>> instructions and tools I used already exist.
>>
> If you have a Github account just create a repo and publish the data and 
> instructions. 
>
>>
>> ElMagoElGato
>>
>> 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou:
>>
>>> Hello,
>>> Are you planning to release the dataset or models?
>>> I'm working on the same subject and planning to share both under BSD 
>>> terms
>>>
>>> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago wrote:
>>>>
>>>> Hi,
>>>>
>>>> FWIW, I got to the point where I can feel happy with the accuracy. As 
>>>> the images of the previous post show, the symbols, especially on-us symbol 
>>>> and amount symbol, were causing mix-up each other or to another character. 
>>>>  
>>>> I added much more more symbols to the training text and formed words that 
>>>> start with a symbol.  One example is as follows:
>>>>
>>>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=;
>>>>
>>>>
>>>> I randomly made 8,000 lines like this.  In fine-tuning from eng, 5,000 
>>>> iteration was almost good.  Amount symbol still is confused a little when 
>>>> it's followed by 0.  Fine tuning tends to be dragged by small particles.  
>>>> I'll have to think of something to make further improvement.
>>>>
>>>> Training from scratch produced a bit more stable traineddata.  It 
>>>> doesn't get confused with symbols so often but tends to generate extra 
>>>> spaces.  By 10,000 iterations, those spaces are gone and recognition 
>>>> became 
>>>> very solid.
>>>>
>>>> I thought I might have to do image and box file training but I guess 
>>>> it's not needed this time.
>>>>
>>>> ElMagoElGato
>>>>
>>>> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago:
>>>>>
>>>>> HI,
>>>>>
>>>>> Well, I read the description of ScrollView (
>>>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) and 
>>>>> it says:
>>>>>
>>>>> To show the characters, deselect DISPLAY/Bounding Boxes, select 
>>>>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display.
>>>>>
>>>>>
>>>>> It basically works.  But for some reason, it doesn't work on my e13b 
>>>>> image and ends up with a blue screen.  Anyway, it shows each box 
>>>>> separately 
>>>>> when a character is consist of multiple boxes.  I'd like to show the box 
>>>>> for the whole character.  ScrollView doesn't do it, at least, yet.  I'll 
>>>>> do 
>>>>> it on my own.
>>>>>
>>>>> ElMagoElGato
>>>>>
>>>>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> I got this result from hocr.  This is where one of the phantom 
>>>>>> characters comes from.
>>>>>>
>>>>>> 
>>>>>> ;
>>>>>>
>>>>>>
>>>>>> The firs character is the phantom.  It starts with the second 
>>>>>> character that exists on x axis.  The first character only has 3 points 
>>>>>> width.  I attach ScrollView screen shots that visualize this.
>>>>>>
>>>>>> [image: 2019-07-24-132643_854x707_scrot.png][image: 
>>>>>> 2019-07-24-132800_854x707_scrot.png]
>>>>>>
>>>>>>
>>>>>> There seem to be some more cases to cause phantom characters.  I'll 
>>>>>> look them in.  But I have a trivial question now.  I made ScrollView 
>>>>>> show 
>>>>>> these displays by accidentally clicking Display->Blamer menu.  There is 
>>>>>> Bounding Boxes menu below but it ends up showing a blue screen thou

Re: [tesseract-ocr] Trained data for E13B font

2019-08-06 Thread ElGato ElMago

HI,

I'm thinking of sharing it of course.  What is the best way to do it?  
After all this, the contribution part of mine is only how I prepared the 
training text.  Even that is consist of Shree's text and mine.  The 
instructions and tools I used already exist.

ElMagoElGato

2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou:

> Hello,
> Are you planning to release the dataset or models?
> I'm working on the same subject and planning to share both under BSD terms
>
> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago wrote:
>>
>> Hi,
>>
>> FWIW, I got to the point where I can feel happy with the accuracy. As the 
>> images of the previous post show, the symbols, especially on-us symbol and 
>> amount symbol, were causing mix-up each other or to another character.  I 
>> added much more more symbols to the training text and formed words that 
>> start with a symbol.  One example is as follows:
>>
>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=;
>>
>>
>> I randomly made 8,000 lines like this.  In fine-tuning from eng, 5,000 
>> iteration was almost good.  Amount symbol still is confused a little when 
>> it's followed by 0.  Fine tuning tends to be dragged by small particles.  
>> I'll have to think of something to make further improvement.
>>
>> Training from scratch produced a bit more stable traineddata.  It doesn't 
>> get confused with symbols so often but tends to generate extra spaces.  By 
>> 10,000 iterations, those spaces are gone and recognition became very solid.
>>
>> I thought I might have to do image and box file training but I guess it's 
>> not needed this time.
>>
>> ElMagoElGato
>>
>> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago:
>>>
>>> HI,
>>>
>>> Well, I read the description of ScrollView (
>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) and it 
>>> says:
>>>
>>> To show the characters, deselect DISPLAY/Bounding Boxes, select 
>>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display.
>>>
>>>
>>> It basically works.  But for some reason, it doesn't work on my e13b 
>>> image and ends up with a blue screen.  Anyway, it shows each box separately 
>>> when a character is consist of multiple boxes.  I'd like to show the box 
>>> for the whole character.  ScrollView doesn't do it, at least, yet.  I'll do 
>>> it on my own.
>>>
>>> ElMagoElGato
>>>
>>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago:
>>>>
>>>> Hi,
>>>>
>>>>
>>>> I got this result from hocr.  This is where one of the phantom 
>>>> characters comes from.
>>>>
>>>> 
>>>> ;
>>>>
>>>>
>>>> The firs character is the phantom.  It starts with the second character 
>>>> that exists on x axis.  The first character only has 3 points width.  I 
>>>> attach ScrollView screen shots that visualize this.
>>>>
>>>> [image: 2019-07-24-132643_854x707_scrot.png][image: 
>>>> 2019-07-24-132800_854x707_scrot.png]
>>>>
>>>>
>>>> There seem to be some more cases to cause phantom characters.  I'll 
>>>> look them in.  But I have a trivial question now.  I made ScrollView show 
>>>> these displays by accidentally clicking Display->Blamer menu.  There is 
>>>> Bounding Boxes menu below but it ends up showing a blue screen though it 
>>>> briefly shows boxes on the way.  Can I use this menu at all?  It'll be 
>>>> very 
>>>> useful.
>>>>
>>>> [image: 2019-07-24-140739_854x707_scrot.png]
>>>>
>>>>
>>>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago:
>>>>>
>>>>> It's great! Perfect!  Thanks a lot!
>>>>>
>>>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree:
>>>>>>
>>>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580
>>>>>>
>>>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago,  
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I read the output of hocr with lstm_choice_mode = 4 as to the pull 
>>>>>>> request 2554.  It shows the candidates for each character but doesn't 
>>>>>>> show 
>>>>>>> bounding box of each character.  I only shows the box for a whole word.
>>>>>>>
>>>>>>> I see b

Re: [tesseract-ocr] Trained data for E13B font

2019-08-06 Thread ElGato ElMago

Hi,

FWIW, I got to the point where I can feel happy with the accuracy. As the 
images of the previous post show, the symbols, especially on-us symbol and 
amount symbol, were causing mix-up each other or to another character.  I 
added much more more symbols to the training text and formed words that 
start with a symbol.  One example is as follows:

9;:;=;<;< <0<1<3<4;6;8;9;:;=;


I randomly made 8,000 lines like this.  In fine-tuning from eng, 5,000 
iteration was almost good.  Amount symbol still is confused a little when 
it's followed by 0.  Fine tuning tends to be dragged by small particles.  
I'll have to think of something to make further improvement.

Training from scratch produced a bit more stable traineddata.  It doesn't 
get confused with symbols so often but tends to generate extra spaces.  By 
10,000 iterations, those spaces are gone and recognition became very solid.

I thought I might have to do image and box file training but I guess it's 
not needed this time.

ElMagoElGato

2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago:
>
> HI,
>
> Well, I read the description of ScrollView (
> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) and it 
> says:
>
> To show the characters, deselect DISPLAY/Bounding Boxes, select 
> DISPLAY/Polygonal Approx and then select OTHER/Uniform display.
>
>
> It basically works.  But for some reason, it doesn't work on my e13b image 
> and ends up with a blue screen.  Anyway, it shows each box separately when 
> a character is consist of multiple boxes.  I'd like to show the box for the 
> whole character.  ScrollView doesn't do it, at least, yet.  I'll do it on 
> my own.
>
> ElMagoElGato
>
> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago:
>>
>> Hi,
>>
>>
>> I got this result from hocr.  This is where one of the phantom characters 
>> comes from.
>>
>> 
>> ;
>>
>>
>> The firs character is the phantom.  It starts with the second character 
>> that exists on x axis.  The first character only has 3 points width.  I 
>> attach ScrollView screen shots that visualize this.
>>
>> [image: 2019-07-24-132643_854x707_scrot.png][image: 
>> 2019-07-24-132800_854x707_scrot.png]
>>
>>
>> There seem to be some more cases to cause phantom characters.  I'll look 
>> them in.  But I have a trivial question now.  I made ScrollView show these 
>> displays by accidentally clicking Display->Blamer menu.  There is Bounding 
>> Boxes menu below but it ends up showing a blue screen though it briefly 
>> shows boxes on the way.  Can I use this menu at all?  It'll be very useful.
>>
>> [image: 2019-07-24-140739_854x707_scrot.png]
>>
>>
>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago:
>>>
>>> It's great! Perfect!  Thanks a lot!
>>>
>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree:
>>>>
>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580
>>>>
>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago,  wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I read the output of hocr with lstm_choice_mode = 4 as to the pull 
>>>>> request 2554.  It shows the candidates for each character but doesn't 
>>>>> show 
>>>>> bounding box of each character.  I only shows the box for a whole word.
>>>>>
>>>>> I see bounding boxes of each character in comments of the pull request 
>>>>> 2576.  How can I do that?  Do I have to look in the source code and 
>>>>> manipulate such an output on my own?
>>>>>
>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago:
>>>>>
>>>>>> Lorenzo,
>>>>>>
>>>>>> I haven't been checking psm too much.  Will turn to those options 
>>>>>> after I see how it goes with bounding boxes.
>>>>>>
>>>>>> Shree,
>>>>>>
>>>>>> I see the merges in the git log and also see that new 
>>>>>> option lstm_choice_amount works now.  I guess my executable is latest 
>>>>>> though I still see the phantom character.  Hocr makes huge and complex 
>>>>>> output.  I'll take some to read it.
>>>>>>
>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu:
>>>>>>>
>>>>>>> Is there any way to pass bounding boxes to use to the LSTM? We have 
>>>>>>> an algorithm that cleanly gets bounding boxes of MRZ characters. 
>>>>>>> However 
>>>>>>&g

[tesseract-ocr] Re: Problems with training tesseract

2019-08-04 Thread ElGato ElMago

Did you specify language option for ocrb when you read it?

2019年8月3日土曜日 0時56分25秒 UTC+9 Cristobal Jesus Muñoz Solano:
>
> Hello, I am trying to use tesseract and I have read all the documentation 
> and I have done many tests, sorry if this is not the place to ask this 
> question, but I have been researching for several days and I am having many 
> doubts and I do not know what to do or where to investigate , I'm 
> frustrated.
>
> 1) If I want to train tesseract to improve its efficiency by reading 
> images with font OCR-B, should I first do a tuning by adding the OCR-B 
> font? or I can create a trainnedata directly with the images/box and then 
> combine it with the best model.
>
> 2) How do I add many images / box to the best model.
>
> 3) Once you have a .trainneddata ready and save it in tessdata is it 
> enough for you to test when you run it use that data to read the images?
>
> I already tried this script
> https://github.com/Shreeshrii/tessdata_ocrb
>
> but I still don't understand how to add new training images to the best 
> model
>
> please help me, I don't want to kill myself so young
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/43d67fd7-8755-460c-a3ae-64808202f420%40googlegroups.com.

[tesseract-ocr] Re: How to add dictionary to training?

2019-08-04 Thread ElGato ElMago

I guess you don't do training with dictionary.  You only use it when you 
read image.

2019年8月3日土曜日 1時48分08秒 UTC+9 Mox Betex:
>
> I want to do fine tuning and I want to add my dictionary of words.
> How to do that, what file to create?
> Do I need to add dictionary for training or after?
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8835bda2-482b-45e5-a66b-0bb40570a5c4%40googlegroups.com.

Re: [tesseract-ocr] Use Tesseract dll with c project

2019-07-26 Thread ElGato ElMago

Like Rele said, I just used you for example, too.  The problem is the whole 
situation.

I encourage you to keep going.  People will help you.  I'm not an expert of 
C or C++, either, but I imagine you can do something from the C++ sample.  
Then it'll be much easier and even be fun.

2019年7月26日金曜日 14時10分59秒 UTC+9 Pooja Kamra:

> @ElGato and , sorry if my question bothered you both. But the link 
> which has been sent is not for C language, it is for C++.
> I already go through that link before writing to the Forum. And after 
> surfing only i post to the forum.
>
> On Friday, July 26, 2019 at 5:51:47 AM UTC+5:30, ElGato ElMago wrote:
>>
>> I feel the same.  I see many effortless questions. Can we do something?
>>
>> 2019年7月25日木曜日 20時01分27秒 UTC+9 René Hansen:
>>>
>>> It's *literally* one of the main items in list of wiki pages:
>>>
>>> https://github.com/tesseract-ocr/tesseract/wiki/APIExample
>>>
>>> Is it me, or is this a growing trend on this mailing list? E.g. people 
>>> just firing off emails with completely open questions instead of doing a 
>>> minimal effort to search or read the documentation.
>>>
>>> Sorry for using you as an example Pooja, I just feel like this has 
>>> become a problem and should be addressed.
>>>
>>>
>>> /René
>>>
>>>
>>> On Thu, 25 Jul 2019 at 12:22, Pooja Kamra  wrote:
>>>
>>>> OK Zdenko. But do we have some sample code to use tesseract functions 
>>>> in c application.
>>>>
>>>> On Thursday, July 25, 2019 at 1:09:02 PM UTC+5:30, zdenop wrote:
>>>>>
>>>>> I would suggest to start reading doc/wiki.
>>>>>
>>>>> Zdenko
>>>>>
>>>>>
>>>>> št 25. 7. 2019 o 8:36 Pooja Kamra  napísal(a):
>>>>>
>>>>>> Hi,
>>>>>> I want to use libtesseract.dll in C project. In tesseract source file 
>>>>>> there is a header file capi.h.
>>>>>> How can i use these functions in c exe project.
>>>>>> Please suggest.
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to tesser...@googlegroups.com.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/723dcea6-12bc-4ceb-a4c0-7b37e4edd1b7%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/723dcea6-12bc-4ceb-a4c0-7b37e4edd1b7%40googlegroups.com?utm_medium=email_source=footer>
>>>>>> .
>>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesser...@googlegroups.com.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/858169d5-9329-4651-8573-34e65d5841e5%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/858169d5-9329-4651-8573-34e65d5841e5%40googlegroups.com?utm_medium=email_source=footer>
>>>> .
>>>>
>>>
>>>
>>> -- 
>>> Never fear, Linux is here.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3b1c4d51-10fb-46e9-a117-bea77e085e5e%40googlegroups.com.

Re: [tesseract-ocr] Trained data for E13B font

2019-07-25 Thread ElGato ElMago

HI,

Well, I read the description of ScrollView (
https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) and it 
says:

To show the characters, deselect DISPLAY/Bounding Boxes, select 
DISPLAY/Polygonal Approx and then select OTHER/Uniform display.


It basically works.  But for some reason, it doesn't work on my e13b image 
and ends up with a blue screen.  Anyway, it shows each box separately when 
a character is consist of multiple boxes.  I'd like to show the box for the 
whole character.  ScrollView doesn't do it, at least, yet.  I'll do it on 
my own.

ElMagoElGato

2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago:
>
> Hi,
>
>
> I got this result from hocr.  This is where one of the phantom characters 
> comes from.
>
> 
> ;
>
>
> The firs character is the phantom.  It starts with the second character 
> that exists on x axis.  The first character only has 3 points width.  I 
> attach ScrollView screen shots that visualize this.
>
> [image: 2019-07-24-132643_854x707_scrot.png][image: 
> 2019-07-24-132800_854x707_scrot.png]
>
>
> There seem to be some more cases to cause phantom characters.  I'll look 
> them in.  But I have a trivial question now.  I made ScrollView show these 
> displays by accidentally clicking Display->Blamer menu.  There is Bounding 
> Boxes menu below but it ends up showing a blue screen though it briefly 
> shows boxes on the way.  Can I use this menu at all?  It'll be very useful.
>
> [image: 2019-07-24-140739_854x707_scrot.png]
>
>
> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago:
>>
>> It's great! Perfect!  Thanks a lot!
>>
>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree:
>>>
>>> See https://github.com/tesseract-ocr/tesseract/issues/2580
>>>
>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago,  wrote:
>>>
>>>> Hi,
>>>>
>>>> I read the output of hocr with lstm_choice_mode = 4 as to the pull 
>>>> request 2554.  It shows the candidates for each character but doesn't show 
>>>> bounding box of each character.  I only shows the box for a whole word.
>>>>
>>>> I see bounding boxes of each character in comments of the pull request 
>>>> 2576.  How can I do that?  Do I have to look in the source code and 
>>>> manipulate such an output on my own?
>>>>
>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago:
>>>>
>>>>> Lorenzo,
>>>>>
>>>>> I haven't been checking psm too much.  Will turn to those options 
>>>>> after I see how it goes with bounding boxes.
>>>>>
>>>>> Shree,
>>>>>
>>>>> I see the merges in the git log and also see that new 
>>>>> option lstm_choice_amount works now.  I guess my executable is latest 
>>>>> though I still see the phantom character.  Hocr makes huge and complex 
>>>>> output.  I'll take some to read it.
>>>>>
>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu:
>>>>>>
>>>>>> Is there any way to pass bounding boxes to use to the LSTM? We have 
>>>>>> an algorithm that cleanly gets bounding boxes of MRZ characters. However 
>>>>>> the results using psm 10 are worse than passing the whole line in. Yet 
>>>>>> when 
>>>>>> we pass the whole line in we get these phantom characters. 
>>>>>>
>>>>>> Should PSM 10 mode work? It often returns “no character” where there 
>>>>>> clearly is one. I can supply a test case if it is expected to work well. 
>>>>>>
>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago  
>>>>>> wrote:
>>>>>>
>>>>>>> Lorenzo,
>>>>>>>
>>>>>>> We both have got the same case.  It seems a solution to this problem 
>>>>>>> would save a lot of people.
>>>>>>>
>>>>>>> Shree,
>>>>>>>
>>>>>>> I pulled the current head of master branch but it doesn't seem to 
>>>>>>> contain the merges you pointed that have been merged 3 to 4 days ago.  
>>>>>>> How 
>>>>>>> can I get them?
>>>>>>>
>>>>>>> ElMagoElGato
>>>>>>>
>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> PSM 7 was a partial solution for my specific c

Re: [tesseract-ocr] Use Tesseract dll with c project

2019-07-25 Thread ElGato ElMago

I feel the same.  I see many effortless questions. Can we do something?

2019年7月25日木曜日 20時01分27秒 UTC+9 René Hansen:
>
> It's *literally* one of the main items in list of wiki pages:
>
> https://github.com/tesseract-ocr/tesseract/wiki/APIExample
>
> Is it me, or is this a growing trend on this mailing list? E.g. people 
> just firing off emails with completely open questions instead of doing a 
> minimal effort to search or read the documentation.
>
> Sorry for using you as an example Pooja, I just feel like this has become 
> a problem and should be addressed.
>
>
> /René
>
>
> On Thu, 25 Jul 2019 at 12:22, Pooja Kamra  > wrote:
>
>> OK Zdenko. But do we have some sample code to use tesseract functions in 
>> c application.
>>
>> On Thursday, July 25, 2019 at 1:09:02 PM UTC+5:30, zdenop wrote:
>>>
>>> I would suggest to start reading doc/wiki.
>>>
>>> Zdenko
>>>
>>>
>>> št 25. 7. 2019 o 8:36 Pooja Kamra  napísal(a):
>>>
 Hi,
 I want to use libtesseract.dll in C project. In tesseract source file 
 there is a header file capi.h.
 How can i use these functions in c exe project.
 Please suggest.

 -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesser...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/723dcea6-12bc-4ceb-a4c0-7b37e4edd1b7%40googlegroups.com
  
 
 .

>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesser...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/858169d5-9329-4651-8573-34e65d5841e5%40googlegroups.com
>>  
>> 
>> .
>>
>
>
> -- 
> Never fear, Linux is here.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4bca6f68-ec42-4991-b965-46033e41972e%40googlegroups.com.

[tesseract-ocr] Re: Support for New Reiwa Era Character ㋿ in Japanese

2019-07-24 Thread ElGato ElMago

You can train it.

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract

2019年7月24日水曜日 19時49分13秒 UTC+9 Prateek Mehta:

> There's a new character introduced ㋿ (U+32FF). Support for this character 
> is required.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/11ab5b72-9446-4935-b5aa-e9f13705c6c1%40googlegroups.com.

Re: [tesseract-ocr] Trained data for E13B font

2019-07-23 Thread ElGato ElMago



Hi,


I got this result from hocr.  This is where one of the phantom characters 
comes from.


;


The firs character is the phantom.  It starts with the second character 
that exists on x axis.  The first character only has 3 points width.  I 
attach ScrollView screen shots that visualize this.

[image: 2019-07-24-132643_854x707_scrot.png][image: 
2019-07-24-132800_854x707_scrot.png]


There seem to be some more cases to cause phantom characters.  I'll look 
them in.  But I have a trivial question now.  I made ScrollView show these 
displays by accidentally clicking Display->Blamer menu.  There is Bounding 
Boxes menu below but it ends up showing a blue screen though it briefly 
shows boxes on the way.  Can I use this menu at all?  It'll be very useful.

[image: 2019-07-24-140739_854x707_scrot.png]


2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago:
>
> It's great! Perfect!  Thanks a lot!
>
> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree:
>>
>> See https://github.com/tesseract-ocr/tesseract/issues/2580
>>
>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago,  wrote:
>>
>>> Hi,
>>>
>>> I read the output of hocr with lstm_choice_mode = 4 as to the pull 
>>> request 2554.  It shows the candidates for each character but doesn't show 
>>> bounding box of each character.  I only shows the box for a whole word.
>>>
>>> I see bounding boxes of each character in comments of the pull request 
>>> 2576.  How can I do that?  Do I have to look in the source code and 
>>> manipulate such an output on my own?
>>>
>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago:
>>>
>>>> Lorenzo,
>>>>
>>>> I haven't been checking psm too much.  Will turn to those options after 
>>>> I see how it goes with bounding boxes.
>>>>
>>>> Shree,
>>>>
>>>> I see the merges in the git log and also see that new 
>>>> option lstm_choice_amount works now.  I guess my executable is latest 
>>>> though I still see the phantom character.  Hocr makes huge and complex 
>>>> output.  I'll take some to read it.
>>>>
>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu:
>>>>>
>>>>> Is there any way to pass bounding boxes to use to the LSTM? We have an 
>>>>> algorithm that cleanly gets bounding boxes of MRZ characters. However the 
>>>>> results using psm 10 are worse than passing the whole line in. Yet when 
>>>>> we 
>>>>> pass the whole line in we get these phantom characters. 
>>>>>
>>>>> Should PSM 10 mode work? It often returns “no character” where there 
>>>>> clearly is one. I can supply a test case if it is expected to work well. 
>>>>>
>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago  
>>>>> wrote:
>>>>>
>>>>>> Lorenzo,
>>>>>>
>>>>>> We both have got the same case.  It seems a solution to this problem 
>>>>>> would save a lot of people.
>>>>>>
>>>>>> Shree,
>>>>>>
>>>>>> I pulled the current head of master branch but it doesn't seem to 
>>>>>> contain the merges you pointed that have been merged 3 to 4 days ago.  
>>>>>> How 
>>>>>> can I get them?
>>>>>>
>>>>>> ElMagoElGato
>>>>>>
>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> PSM 7 was a partial solution for my specific case, it improved the 
>>>>>>> situation but did not solve it. Also I could not use it in some other 
>>>>>>> cases.
>>>>>>>
>>>>>>> The proper solution is very likely doing more training with more 
>>>>>>> data, some data augmentation might probably help if data is scarce.
>>>>>>> Also doing less training might help is the training is not done 
>>>>>>> correctly.
>>>>>>>
>>>>>>> There are also similar issues on github:
>>>>>>>
>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465
>>>>>>> ...
>>>>>>>
>>>>>>> The LSTM engine works like this: it scans the image and for each 
>>>>>>> "pixel column" does this:
>>>>>>>
>>>>>>> M M M M N M M M [BLANK] F F F F
>&

Re: [tesseract-ocr] Trained data for E13B font

2019-07-23 Thread ElGato ElMago

It's great! Perfect!  Thanks a lot!

2019年7月23日火曜日 10時56分58秒 UTC+9 shree:
>
> See https://github.com/tesseract-ocr/tesseract/issues/2580
>
> On Tue, 23 Jul 2019, 06:23 ElGato ElMago,  > wrote:
>
>> Hi,
>>
>> I read the output of hocr with lstm_choice_mode = 4 as to the pull 
>> request 2554.  It shows the candidates for each character but doesn't show 
>> bounding box of each character.  I only shows the box for a whole word.
>>
>> I see bounding boxes of each character in comments of the pull request 
>> 2576.  How can I do that?  Do I have to look in the source code and 
>> manipulate such an output on my own?
>>
>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago:
>>
>>> Lorenzo,
>>>
>>> I haven't been checking psm too much.  Will turn to those options after 
>>> I see how it goes with bounding boxes.
>>>
>>> Shree,
>>>
>>> I see the merges in the git log and also see that new 
>>> option lstm_choice_amount works now.  I guess my executable is latest 
>>> though I still see the phantom character.  Hocr makes huge and complex 
>>> output.  I'll take some to read it.
>>>
>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu:
>>>>
>>>> Is there any way to pass bounding boxes to use to the LSTM? We have an 
>>>> algorithm that cleanly gets bounding boxes of MRZ characters. However the 
>>>> results using psm 10 are worse than passing the whole line in. Yet when we 
>>>> pass the whole line in we get these phantom characters. 
>>>>
>>>> Should PSM 10 mode work? It often returns “no character” where there 
>>>> clearly is one. I can supply a test case if it is expected to work well. 
>>>>
>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago  
>>>> wrote:
>>>>
>>>>> Lorenzo,
>>>>>
>>>>> We both have got the same case.  It seems a solution to this problem 
>>>>> would save a lot of people.
>>>>>
>>>>> Shree,
>>>>>
>>>>> I pulled the current head of master branch but it doesn't seem to 
>>>>> contain the merges you pointed that have been merged 3 to 4 days ago.  
>>>>> How 
>>>>> can I get them?
>>>>>
>>>>> ElMagoElGato
>>>>>
>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz:
>>>>>>
>>>>>>
>>>>>>
>>>>>> PSM 7 was a partial solution for my specific case, it improved the 
>>>>>> situation but did not solve it. Also I could not use it in some other 
>>>>>> cases.
>>>>>>
>>>>>> The proper solution is very likely doing more training with more 
>>>>>> data, some data augmentation might probably help if data is scarce.
>>>>>> Also doing less training might help is the training is not done 
>>>>>> correctly.
>>>>>>
>>>>>> There are also similar issues on github:
>>>>>>
>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465
>>>>>> ...
>>>>>>
>>>>>> The LSTM engine works like this: it scans the image and for each 
>>>>>> "pixel column" does this:
>>>>>>
>>>>>> M M M M N M M M [BLANK] F F F F
>>>>>>
>>>>>> (here i report only the highest probability characters)
>>>>>>
>>>>>> In the example above an M is partially seen as an N, this is normal, 
>>>>>> and another step of the algorithm (beam search I think) tries to 
>>>>>> aggregate 
>>>>>> back the correct characters.
>>>>>>
>>>>>> I think cases like this:
>>>>>>
>>>>>> M M M N N N M M
>>>>>>
>>>>>> are what gives the phantom characters. More training should reduce 
>>>>>> the source of the problem or a painful analysis of the bounding boxes 
>>>>>> might 
>>>>>> fix some cases.
>>>>>>
>>>>>>
>>>>>> I used the attached script for the boxes.
>>>>>>
>>>>>>
>>>>>> Lorenzo
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago &

Re: [tesseract-ocr] Trained data for E13B font

2019-07-22 Thread ElGato ElMago

Hi,

I read the output of hocr with lstm_choice_mode = 4 as to the pull request 
2554.  It shows the candidates for each character but doesn't show bounding 
box of each character.  I only shows the box for a whole word.

I see bounding boxes of each character in comments of the pull request 
2576.  How can I do that?  Do I have to look in the source code and 
manipulate such an output on my own?

2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago:

> Lorenzo,
>
> I haven't been checking psm too much.  Will turn to those options after I 
> see how it goes with bounding boxes.
>
> Shree,
>
> I see the merges in the git log and also see that new 
> option lstm_choice_amount works now.  I guess my executable is latest 
> though I still see the phantom character.  Hocr makes huge and complex 
> output.  I'll take some to read it.
>
> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu:
>>
>> Is there any way to pass bounding boxes to use to the LSTM? We have an 
>> algorithm that cleanly gets bounding boxes of MRZ characters. However the 
>> results using psm 10 are worse than passing the whole line in. Yet when we 
>> pass the whole line in we get these phantom characters. 
>>
>> Should PSM 10 mode work? It often returns “no character” where there 
>> clearly is one. I can supply a test case if it is expected to work well. 
>>
>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago  
>> wrote:
>>
>>> Lorenzo,
>>>
>>> We both have got the same case.  It seems a solution to this problem 
>>> would save a lot of people.
>>>
>>> Shree,
>>>
>>> I pulled the current head of master branch but it doesn't seem to 
>>> contain the merges you pointed that have been merged 3 to 4 days ago.  How 
>>> can I get them?
>>>
>>> ElMagoElGato
>>>
>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz:
>>>>
>>>>
>>>>
>>>> PSM 7 was a partial solution for my specific case, it improved the 
>>>> situation but did not solve it. Also I could not use it in some other 
>>>> cases.
>>>>
>>>> The proper solution is very likely doing more training with more data, 
>>>> some data augmentation might probably help if data is scarce.
>>>> Also doing less training might help is the training is not done 
>>>> correctly.
>>>>
>>>> There are also similar issues on github:
>>>>
>>>> https://github.com/tesseract-ocr/tesseract/issues/1465
>>>> ...
>>>>
>>>> The LSTM engine works like this: it scans the image and for each "pixel 
>>>> column" does this:
>>>>
>>>> M M M M N M M M [BLANK] F F F F
>>>>
>>>> (here i report only the highest probability characters)
>>>>
>>>> In the example above an M is partially seen as an N, this is normal, 
>>>> and another step of the algorithm (beam search I think) tries to aggregate 
>>>> back the correct characters.
>>>>
>>>> I think cases like this:
>>>>
>>>> M M M N N N M M
>>>>
>>>> are what gives the phantom characters. More training should reduce the 
>>>> source of the problem or a painful analysis of the bounding boxes might 
>>>> fix 
>>>> some cases.
>>>>
>>>>
>>>> I used the attached script for the boxes.
>>>>
>>>>
>>>> Lorenzo
>>>>
>>>>
>>>>
>>>>
>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago <
>>>> elmago...@gmail.com> ha scritto:
>>>>
>>> Hi,
>>>>>
>>>>> Let's call them phantom characters then.
>>>>>
>>>>> Was psm 7 the solution for the issue 1778?  None of the psm option 
>>>>> didn't solve my problem though I see different output.
>>>>>
>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same results 
>>>>> anyway.  How did you get bounding box for each character?  Alto and 
>>>>> lstmbox 
>>>>> only show bbox for a group of characters.
>>>>>
>>>>> ElMagoElGato
>>>>>
>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz:
>>>>>
>>>>>> Phantom characters here for me too:
>>>>>>
>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778
>>>>>>
>>>>>> Are you using 4.1? Boundin

Re: [tesseract-ocr] understading lstmeval and use it on pretrained models for comparison

2019-07-21 Thread ElGato ElMago

Yes.  This is a very good write-up and helpful to traininers.

2019年7月20日土曜日 0時43分56秒 UTC+9 shree:
>
> Very well written. You may want to update the wiki pages with the info too.
>
> On Fri, Jul 19, 2019 at 7:45 PM Arno Loo > 
> wrote:
>
>> I went and tried to understand the source code as well as I could and 
>> although I did not find all the answers I did find some. (for tesseract 
>> 4.0.0-beta.3)
>> At iteration 14615/695400/698614, Mean rms=0.158%, delta=0.295%, char
>>  train=1.882%, word train=2.285%, skip ratio=0.4%,  wrote checkpoint.
>>
>> In the above example,
>> 14615 : learning_iteration
>> 695400 : training_iteration
>> 698614 : sample_iteration
>>
>> *sample_iteration* : "Index into training sample set. (sample_iteration 
>> >= training_iteration)." It is how many times a training file has been 
>> passed into the learning process
>> *training_iteration* : "Number of actual backward training steps used." 
>> It is how many times a training file has been SUCCESSFULLY passed into the 
>> learning process
>>
>> So everytime you get an error : "Image too large to learn!!" - "Encoding 
>> of string failed!" - "Deserialize header failed", the sample_iteration 
>> increments but not the training_iteration.
>> Actually you have 1 - (695400 / 698614) = 0.4% which is the *skip ratio* : 
>> proportion of files that have been skiped because of an error
>>
>> *learning_iteration* : "Number of iterations that yielded a non-zero 
>> delta error and thus provided significant learning. (learning_iteration <= 
>> training_iteration). learning_iteration_ is used to measure rate of 
>> learning progress."
>> So it uses the *delta* value to assess it the iteration has been useful.
>>
>> What is good to know is that when you specify a maximum number of 
>> iteration to the training process it uses the middle iteration number 
>> (training_iteration) to know when to stop. But when it writes a checkpoint, 
>> the checkpoint name uses the smallest iteration number 
>> (learning_iteration). Along with the *char train* rate. So a checkpoint 
>> name is the concatenation of model_name & char_train & learning_iteration
>>
>> --
>>
>> But there are still a lot of things I do not understand. And one of them 
>> is actually causing me an issue : even with a lot of iterations (475k) I 
>> still do not see any log message with the error on the evaluation set.
>> At iteration 61235/475300/475526, Mean rms=0.521%, delta=2.073%, char
>>  train=9.379%, word train=9.669%, skip ratio=0.1%,  New worst char error 
>> = 9.379 wrote checkpoint.
>>
>>
>>
>> Le vendredi 28 juin 2019 17:39:52 UTC+2, shree a écrit :
>>>
>>> Your best source for documentation is the source code. See
>>>
>>>
>>> https://github.com/tesseract-ocr/tesseract/blob/f522b039a52ae0094fb928ac60a66c4ae0f6c5b9/src/training/lstmtrainer.cpp#L371
>>>  
>>>
>>>
>>> https://github.com/tesseract-ocr/tesseract/blob/f522b039a52ae0094fb928ac60a66c4ae0f6c5b9/src/training/lstmtrainer.cpp#L382
>>>  
>>>
>>> On Fri, Jun 28, 2019 at 8:47 PM Arno Loo  wrote:
>>>
 I continue to make experiments and trying to understand what seems 
 important and I have a few questions after a research in Tesseract's wiki

 During the training we can see this kind of information :
 At iteration 100/100/100, Mean rms=4.514%, delta=19.089%, char train=
 96.314%, word train=100%, skip ratio=0%,  New best char error = 96.314 
 wrote checkpoint.

 - *100/100/100 :* What do this 3 numbers at the begining mean when 
 they are different ? Which they are often, unlike in my example.
 - *Mean rms* I know well, it's the Root Mean Square error. But what 
 error metric is used ? Usually it is some kind of distance, the 
 Levenshtein 
 distance is often appropriate for OCR tasks but the "%" wouldn't be there 
 if it was.
 - *delta* I don't know
 - *char train *must be the percentage of wrong character predictions 
 during the *training*
 - *word train *must be the percentage of wrong word predictions during 
 the *training*
 - * skip ratio *is I think the percentage of samples skip for any 
 reason (invalid data or something)

 Does anyone can help me understand them please ?

 Also, I do not see any error on evaluation during the training. Which 
 would be really helpful to avoid overfitting. The only way I would know 
 how 
 to follow the *evaluation* error during the training would be to try a 
 lstmeval on each checkpoint, but I think there must be a better way ? 
 Otherwise the *--eval_listfile *argument would be useless in 
 lstmtraining, but I can't find out how it is used.

 Thank you :)

 Le jeudi 27 juin 2019 19:17:46 UTC+2, shree a écrit :
>
> See 
> https://github.com/tesseract-ocr/tesseract/blob/master/doc/lstmeval.1.asc
>
> When using checkpoint you need to also use the starter traineddata 
> file used for training.
>
> Or

Re: [tesseract-ocr] Trained data for E13B font

2019-07-19 Thread ElGato ElMago

Lorenzo,

I haven't been checking psm too much.  Will turn to those options after I 
see how it goes with bounding boxes.

Shree,

I see the merges in the git log and also see that new 
option lstm_choice_amount works now.  I guess my executable is latest 
though I still see the phantom character.  Hocr makes huge and complex 
output.  I'll take some to read it.

2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu:
>
> Is there any way to pass bounding boxes to use to the LSTM? We have an 
> algorithm that cleanly gets bounding boxes of MRZ characters. However the 
> results using psm 10 are worse than passing the whole line in. Yet when we 
> pass the whole line in we get these phantom characters. 
>
> Should PSM 10 mode work? It often returns “no character” where there 
> clearly is one. I can supply a test case if it is expected to work well. 
>
> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago  > wrote:
>
>> Lorenzo,
>>
>> We both have got the same case.  It seems a solution to this problem 
>> would save a lot of people.
>>
>> Shree,
>>
>> I pulled the current head of master branch but it doesn't seem to contain 
>> the merges you pointed that have been merged 3 to 4 days ago.  How can I 
>> get them?
>>
>> ElMagoElGato
>>
>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz:
>>>
>>>
>>>
>>> PSM 7 was a partial solution for my specific case, it improved the 
>>> situation but did not solve it. Also I could not use it in some other cases.
>>>
>>> The proper solution is very likely doing more training with more data, 
>>> some data augmentation might probably help if data is scarce.
>>> Also doing less training might help is the training is not done 
>>> correctly.
>>>
>>> There are also similar issues on github:
>>>
>>> https://github.com/tesseract-ocr/tesseract/issues/1465
>>> ...
>>>
>>> The LSTM engine works like this: it scans the image and for each "pixel 
>>> column" does this:
>>>
>>> M M M M N M M M [BLANK] F F F F
>>>
>>> (here i report only the highest probability characters)
>>>
>>> In the example above an M is partially seen as an N, this is normal, and 
>>> another step of the algorithm (beam search I think) tries to aggregate back 
>>> the correct characters.
>>>
>>> I think cases like this:
>>>
>>> M M M N N N M M
>>>
>>> are what gives the phantom characters. More training should reduce the 
>>> source of the problem or a painful analysis of the bounding boxes might fix 
>>> some cases.
>>>
>>>
>>> I used the attached script for the boxes.
>>>
>>>
>>> Lorenzo
>>>
>>>
>>>
>>>
>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago <
>>> elmago...@gmail.com> ha scritto:
>>>
>> Hi,
>>>>
>>>> Let's call them phantom characters then.
>>>>
>>>> Was psm 7 the solution for the issue 1778?  None of the psm option 
>>>> didn't solve my problem though I see different output.
>>>>
>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same results 
>>>> anyway.  How did you get bounding box for each character?  Alto and 
>>>> lstmbox 
>>>> only show bbox for a group of characters.
>>>>
>>>> ElMagoElGato
>>>>
>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz:
>>>>
>>>>> Phantom characters here for me too:
>>>>>
>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778
>>>>>
>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe this was 
>>>>> also improved.
>>>>>
>>>>> I wrote some code that uses symbols iterator to discard symbols that 
>>>>> are clearly duplicated: too small, overlapping, etc. But it was not easy 
>>>>> to 
>>>>> make it work decently and it is not 100% reliable with false negatives 
>>>>> and 
>>>>> positives. I cannot share the code and it is quite ugly anyway.
>>>>>
>>>>> Here there is another MRZ model with training data:
>>>>>
>>>>> https://github.com/DoubangoTelecom/tesseractMRZ
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Lorenzo
>>>>>
>>>>>
>>>>> Il giorno mer 17 lug 2019 alle or

Re: [tesseract-ocr] Trained data for E13B font

2019-07-19 Thread ElGato ElMago

Lorenzo,

We both have got the same case.  It seems a solution to this problem would 
save a lot of people.

Shree,

I pulled the current head of master branch but it doesn't seem to contain 
the merges you pointed that have been merged 3 to 4 days ago.  How can I 
get them?

ElMagoElGato

2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz:
>
>
>
> PSM 7 was a partial solution for my specific case, it improved the 
> situation but did not solve it. Also I could not use it in some other cases.
>
> The proper solution is very likely doing more training with more data, 
> some data augmentation might probably help if data is scarce.
> Also doing less training might help is the training is not done correctly.
>
> There are also similar issues on github:
>
> https://github.com/tesseract-ocr/tesseract/issues/1465
> ...
>
> The LSTM engine works like this: it scans the image and for each "pixel 
> column" does this:
>
> M M M M N M M M [BLANK] F F F F
>
> (here i report only the highest probability characters)
>
> In the example above an M is partially seen as an N, this is normal, and 
> another step of the algorithm (beam search I think) tries to aggregate back 
> the correct characters.
>
> I think cases like this:
>
> M M M N N N M M
>
> are what gives the phantom characters. More training should reduce the 
> source of the problem or a painful analysis of the bounding boxes might fix 
> some cases.
>
>
> I used the attached script for the boxes.
>
>
> Lorenzo
>
>
>
>
> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago <
> elmago...@gmail.com > ha scritto:
>
>> Hi,
>>
>> Let's call them phantom characters then.
>>
>> Was psm 7 the solution for the issue 1778?  None of the psm option didn't 
>> solve my problem though I see different output.
>>
>> I use tesseract 5.0-alpha mostly but 4.1 showed the same results anyway.  
>> How did you get bounding box for each character?  Alto and lstmbox 
>> only show bbox for a group of characters.
>>
>> ElMagoElGato
>>
>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz:
>>
>>> Phantom characters here for me too:
>>>
>>> https://github.com/tesseract-ocr/tesseract/issues/1778
>>>
>>> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe this was also 
>>> improved.
>>>
>>> I wrote some code that uses symbols iterator to discard symbols that are 
>>> clearly duplicated: too small, overlapping, etc. But it was not easy to 
>>> make it work decently and it is not 100% reliable with false negatives and 
>>> positives. I cannot share the code and it is quite ugly anyway.
>>>
>>> Here there is another MRZ model with training data:
>>>
>>> https://github.com/DoubangoTelecom/tesseractMRZ
>>>
>>>
>>>
>>>
>>> Lorenzo
>>>
>>>
>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu  ha 
>>> scritto:
>>>
>>>> I’m getting the “phantom character” issue as well using the OCRB that 
>>>> Shree trained on MRZ lines. For example for a 0 it will sometimes add both 
>>>> a 0 and an O to the output , thus outputting 45 characters total instead 
>>>> of 
>>>> 44. I haven’t looked at the bounding box output yet but I suspect a 
>>>> phantom 
>>>> thin character is added somewhere that I can discard .. or maybe two chars 
>>>> will have the same bounding box. If anyone else has fixed this issue 
>>>> further up (eg so the output doesn’t contain the phantom characters in the 
>>>> first place) id be interested. 
>>>>
>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago  
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'll go back to more of training later.  Before doing so, I'd like to 
>>>>> investigate results a little bit.  The hocr and lstmbox options give some 
>>>>> details of positions of characters.  The results show positions that 
>>>>> perfectly correspond to letters in the image.  But the text output 
>>>>> contains 
>>>>> a character that obviously does not exist.
>>>>>
>>>>> Then I found a config file 'lstmdebug' that generates far more 
>>>>> information.  I hope it explains what happened with each character.  I'm 
>>>>> yet to read the debug output but I'd appreciate it if someone could tell 
>>>>> me 
>>>>> how to read it because it's really complex.
>>

Re: [tesseract-ocr] Trained data for E13B font

2019-07-18 Thread ElGato ElMago

Hi,

Let's call them phantom characters then.

Was psm 7 the solution for the issue 1778?  None of the psm option didn't 
solve my problem though I see different output.

I use tesseract 5.0-alpha mostly but 4.1 showed the same results anyway.  
How did you get bounding box for each character?  Alto and lstmbox 
only show bbox for a group of characters.

ElMagoElGato

2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz:

> Phantom characters here for me too:
>
> https://github.com/tesseract-ocr/tesseract/issues/1778
>
> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe this was also 
> improved.
>
> I wrote some code that uses symbols iterator to discard symbols that are 
> clearly duplicated: too small, overlapping, etc. But it was not easy to 
> make it work decently and it is not 100% reliable with false negatives and 
> positives. I cannot share the code and it is quite ugly anyway.
>
> Here there is another MRZ model with training data:
>
> https://github.com/DoubangoTelecom/tesseractMRZ
>
>
>
>
> Lorenzo
>
>
> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu  > ha scritto:
>
>> I’m getting the “phantom character” issue as well using the OCRB that 
>> Shree trained on MRZ lines. For example for a 0 it will sometimes add both 
>> a 0 and an O to the output , thus outputting 45 characters total instead of 
>> 44. I haven’t looked at the bounding box output yet but I suspect a phantom 
>> thin character is added somewhere that I can discard .. or maybe two chars 
>> will have the same bounding box. If anyone else has fixed this issue 
>> further up (eg so the output doesn’t contain the phantom characters in the 
>> first place) id be interested. 
>>
>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago > > wrote:
>>
>>> Hi,
>>>
>>> I'll go back to more of training later.  Before doing so, I'd like to 
>>> investigate results a little bit.  The hocr and lstmbox options give some 
>>> details of positions of characters.  The results show positions that 
>>> perfectly correspond to letters in the image.  But the text output contains 
>>> a character that obviously does not exist.
>>>
>>> Then I found a config file 'lstmdebug' that generates far more 
>>> information.  I hope it explains what happened with each character.  I'm 
>>> yet to read the debug output but I'd appreciate it if someone could tell me 
>>> how to read it because it's really complex.
>>>
>>> Regards,
>>> ElMagoElGato
>>>
>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree:
>>>
>>>> See https://github.com/Shreeshrii/tessdata_MICR
>>>>
>>>> I have uploaded my files there. 
>>>>
>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
>>>> is the bash script that runs the training.
>>>>
>>>> You can modify as needed. Please note this is for legacy/base tesseract 
>>>> --oem 0.
>>>>
>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago  
>>>> wrote:
>>>>
>>>>> Thanks a lot, shree.  It seems you know everything.
>>>>>
>>>>> I tried the MICR0.traineddata and the first two mcr.traineddata.  The 
>>>>> last one was blocked by the browser.  Each of the traineddata had mixed 
>>>>> results.  All of them are getting symbols fairly good but getting spaces 
>>>>> randomly and reading some numbers wrong.
>>>>>
>>>>> MICR0 seems the best among them.  Did you suggest that you'd be able 
>>>>> to update it?  It gets tripple D very often where there's only one, and 
>>>>> so 
>>>>> on.
>>>>>
>>>>> Also, I tried to fine tune from MICR0 but I found that I need to 
>>>>> change the language-specific.sh.  It specifies some parameters for each 
>>>>> language.  Do you have any guidance for it?
>>>>>
>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>>>>>
>>>>>> see 
>>>>>> http://www.devscope.net/Content/ocrchecks.aspx 
>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR
>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ
>>>>>>  
>>>>>>
>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago  
>>>>>> wrote:
>>>>>>
>>>>>>> That'll be nice if there's traineddata out there but I didn't find 
>>>>>>> any.  I see free fonts

Re: [tesseract-ocr] Trained data for E13B font

2019-07-17 Thread ElGato ElMago

Hi,

I'll go back to more of training later.  Before doing so, I'd like to 
investigate results a little bit.  The hocr and lstmbox options give some 
details of positions of characters.  The results show positions that 
perfectly correspond to letters in the image.  But the text output contains 
a character that obviously does not exist.

Then I found a config file 'lstmdebug' that generates far more 
information.  I hope it explains what happened with each character.  I'm 
yet to read the debug output but I'd appreciate it if someone could tell me 
how to read it because it's really complex.

Regards,
ElMagoElGato

2019年6月14日金曜日 19時58分49秒 UTC+9 shree:

> See https://github.com/Shreeshrii/tessdata_MICR
>
> I have uploaded my files there. 
>
> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
> is the bash script that runs the training.
>
> You can modify as needed. Please note this is for legacy/base tesseract 
> --oem 0.
>
> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago  > wrote:
>
>> Thanks a lot, shree.  It seems you know everything.
>>
>> I tried the MICR0.traineddata and the first two mcr.traineddata.  The 
>> last one was blocked by the browser.  Each of the traineddata had mixed 
>> results.  All of them are getting symbols fairly good but getting spaces 
>> randomly and reading some numbers wrong.
>>
>> MICR0 seems the best among them.  Did you suggest that you'd be able to 
>> update it?  It gets tripple D very often where there's only one, and so on.
>>
>> Also, I tried to fine tune from MICR0 but I found that I need to change 
>> the language-specific.sh.  It specifies some parameters for each language.  
>> Do you have any guidance for it?
>>
>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>>
>>> see 
>>> http://www.devscope.net/Content/ocrchecks.aspx 
>>> https://github.com/BigPino67/Tesseract-MICR-OCR
>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ 
>>>
>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago  
>>> wrote:
>>>
>>>> That'll be nice if there's traineddata out there but I didn't find 
>>>> any.  I see free fonts and commercial OCR software but not traineddata.  
>>>> Tessdata repository obviously doesn't have one, either.
>>>>
>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>>>>
>>>>> Please also search for existing MICR traineddata files.
>>>>>
>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago  
>>>>> wrote:
>>>>>
>>>>>> So I did several tests from scratch.  In the last attempt, I made a 
>>>>>> training text with 4,000 lines in the following format,
>>>>>>
>>>>>> 110004310510<   <02 :4002=0181:801= 0008752 <00039 ;001000;
>>>>>>
>>>>>>
>>>>>> and combined it with eng.digits.training_text in which symbols are 
>>>>>> converted to E13B symbols.  This makes about 12,000 lines of training 
>>>>>> text.  It's amazing that this thing generates a good reader out of 
>>>>>> nowhere.  But then it is not very good.  For example:
>>>>>>
>>>>>> <01 :1901=1386:021= 001<10001< ;090134;
>>>>>>
>>>>>> is a result on the image attached.  It's close but the last '<' in 
>>>>>> the result text doesn't exist on the image.  It's a small failure but it 
>>>>>> causes a greater trouble in parsing.
>>>>>>
>>>>>> What would you suggest from here to increase accuracy?  
>>>>>>
>>>>>>- Increase the number of lines in the training text
>>>>>>- Mix up more variations in the training text
>>>>>>- Increase the number of iterations
>>>>>>- Investigate wrong reads one by one
>>>>>>- Or else?
>>>>>>
>>>>>> Also, I referred to engrestrict*.* and could generate similar result 
>>>>>> with the fine-tuning-from-full method.  It seems a bit faster to get to 
>>>>>> the 
>>>>>> same level but it also stops at a 'good' level.  I can go with either 
>>>>>> way 
>>>>>> if it takes me to the bright future.
>>>>>>
>>>>>> Regards,
>>>>>> ElMagoElGato
>>>>>>
>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>>>>>
>>>>&g

[tesseract-ocr] Re: Suggest a method to improve tesseract results

2019-06-19 Thread ElGato ElMago

Does it have to be distorted like that? It's amazing that human being can 
take it as an S. Is neural network ever capable of doing the same thing?

If I and l do not take the same shape, I'd think of dictionary or post 
processing to switch them around.

2019年6月19日水曜日 20時37分18秒 UTC+9 hrishikesh kaulwar:

> Dear all,
>  In the above image tesseract could not detect the first letter S 
> which is important for my purpose.Also there are few cases where I(capital 
> i) and l(small L) are detected wrongly. what training or method I can use 
> to improve tesseract results in such cases.
>  Thanks in advance.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/710afbd2-ffe5-4e93-b975-029d2b15b27b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: OCR pipeline with OpenCV

2019-06-18 Thread ElGato ElMago

Those images and fonts obviously are not for OCR.  Need to improve images 
and train fonts.

Do you only need to read temparatures?  Then some pattern recognition 
method in OpenCV might be easier to work with.

2019年6月19日水曜日 7時16分21秒 UTC+9 Mox Betex:

> Did you train Tesseract?
>
> Image is of poor quality for OCR, you have to improve it.
> Also check the resolution of image.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5f0c8ae3-f1e7-44b3-8594-3924d28b25b1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Multiline tiff/txt

2019-06-18 Thread ElGato ElMago

To read with tesseract? Why not?

2019年6月18日火曜日 19時11分23秒 UTC+9 Mox Betex:
>
> Can I use multiline tiff/txt files instead of single line tiff/txt?
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b1878685-a1b4-4d47-acd0-096b248030cd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Trained data for E13B font

2019-06-17 Thread ElGato ElMago

I guess the cotent of training text is important when you add new 
characters.  I had the same issue at first and then shree suggested 
a larger text and more iterations.  I thought variation in the text would 
matter as well.  I'm getting good results after I prepared good training 
text.

Now, both training from scratch and fine tuning are giving decent results.  
I'm working on E13B font that existing eng.traineddata never reads.  It 
proves the training really works.  My issue is to bring the accuracy to 
higher level.  I'm yet to try the last suggestion from shree but I know 
that it'll be a long way to go for extreme accuracy.

2019年6月17日月曜日 13時40分10秒 UTC+9 Phuc:

> Sorry if I interrupted your conversation.
> I have a similar problem which is the .traineddata I exported from 
> checkpoint file did not recognize any character at all although my training 
> showed very good results.
> As I understand from you guys' conversation. Is this because Training From 
> Scratch? All I need to do is fine-tuning a model to get better result?
> Also, I am quite confused why result using checkpoint file is so different 
> from .traineddata and I would be appreciated if some one can the explain 
> the reason why.
>
> To have more information about my case, you can refer my post here: 
> https://groups.google.com/forum/?utm_medium=email_source=footer#!topic/tesseract-ocr/74xMXlYX6T0
> Thank you and have a nice day
>
> On Friday, June 14, 2019 at 7:58:49 PM UTC+9, shree wrote:
>>
>> See https://github.com/Shreeshrii/tessdata_MICR
>>
>> I have uploaded my files there. 
>>
>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
>> is the bash script that runs the training.
>>
>> You can modify as needed. Please note this is for legacy/base tesseract 
>> --oem 0.
>>
>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago  
>> wrote:
>>
>>> Thanks a lot, shree.  It seems you know everything.
>>>
>>> I tried the MICR0.traineddata and the first two mcr.traineddata.  The 
>>> last one was blocked by the browser.  Each of the traineddata had mixed 
>>> results.  All of them are getting symbols fairly good but getting spaces 
>>> randomly and reading some numbers wrong.
>>>
>>> MICR0 seems the best among them.  Did you suggest that you'd be able to 
>>> update it?  It gets tripple D very often where there's only one, and so on.
>>>
>>> Also, I tried to fine tune from MICR0 but I found that I need to change 
>>> the language-specific.sh.  It specifies some parameters for each language.  
>>> Do you have any guidance for it?
>>>
>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>>>
>>>> see 
>>>> http://www.devscope.net/Content/ocrchecks.aspx 
>>>> https://github.com/BigPino67/Tesseract-MICR-OCR
>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ 
>>>>
>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago  
>>>> wrote:
>>>>
>>>>> That'll be nice if there's traineddata out there but I didn't find 
>>>>> any.  I see free fonts and commercial OCR software but not traineddata.  
>>>>> Tessdata repository obviously doesn't have one, either.
>>>>>
>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>>>>>
>>>>>> Please also search for existing MICR traineddata files.
>>>>>>
>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago  
>>>>>> wrote:
>>>>>>
>>>>>>> So I did several tests from scratch.  In the last attempt, I made a 
>>>>>>> training text with 4,000 lines in the following format,
>>>>>>>
>>>>>>> 110004310510<   <02 :4002=0181:801= 0008752 <00039 ;001000;
>>>>>>>
>>>>>>>
>>>>>>> and combined it with eng.digits.training_text in which symbols are 
>>>>>>> converted to E13B symbols.  This makes about 12,000 lines of training 
>>>>>>> text.  It's amazing that this thing generates a good reader out of 
>>>>>>> nowhere.  But then it is not very good.  For example:
>>>>>>>
>>>>>>> <01 :1901=1386:021= 001<10001< ;090134;
>>>>>>>
>>>>>>> is a result on the image attached.  It's close but the last '<' in 
>>>>>>> the result text doesn't exist on the image.  It's a small failure but 
>>>>>>> it 
>>>>>>>

[tesseract-ocr] Re: Training on cloud

2019-06-17 Thread ElGato ElMago

Raspberry Pi 3B is enough for me.  It takes 1 to 2 days depending on what 
training.

2019年6月18日火曜日 7時50分04秒 UTC+9 Mox Betex:
>
> I was thinking of paying for Dedicated Server on 
>> https://www.germanvps.com/hg-linux-kvm-hosting.php to train data.
>>
>
> Can someone tell me is this server enough to train data fast? How long can 
> training last with this specification?
>
>- 8 Core Intel Xeon 2.60GHz, 32GB DDR4
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/45e398c9-407f-488b-a644-aa5ce868c434%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Trained data for E13B font

2019-06-14 Thread ElGato ElMago

Thanks a lot, shree.  It seems you know everything.

I tried the MICR0.traineddata and the first two mcr.traineddata.  The last 
one was blocked by the browser.  Each of the traineddata had mixed 
results.  All of them are getting symbols fairly good but getting spaces 
randomly and reading some numbers wrong.

MICR0 seems the best among them.  Did you suggest that you'd be able to 
update it?  It gets tripple D very often where there's only one, and so on.

Also, I tried to fine tune from MICR0 but I found that I need to change the 
language-specific.sh.  It specifies some parameters for each language.  Do 
you have any guidance for it?

2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>
> see 
> http://www.devscope.net/Content/ocrchecks.aspx 
> https://github.com/BigPino67/Tesseract-MICR-OCR
> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ 
>
> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago  > wrote:
>
>> That'll be nice if there's traineddata out there but I didn't find any.  
>> I see free fonts and commercial OCR software but not traineddata.  Tessdata 
>> repository obviously doesn't have one, either.
>>
>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>>
>>> Please also search for existing MICR traineddata files.
>>>
>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago  
>>> wrote:
>>>
>>>> So I did several tests from scratch.  In the last attempt, I made a 
>>>> training text with 4,000 lines in the following format,
>>>>
>>>> 110004310510<   <02 :4002=0181:801= 0008752 <00039 ;001000;
>>>>
>>>>
>>>> and combined it with eng.digits.training_text in which symbols are 
>>>> converted to E13B symbols.  This makes about 12,000 lines of training 
>>>> text.  It's amazing that this thing generates a good reader out of 
>>>> nowhere.  But then it is not very good.  For example:
>>>>
>>>> <01 :1901=1386:021= 001<10001< ;090134;
>>>>
>>>> is a result on the image attached.  It's close but the last '<' in the 
>>>> result text doesn't exist on the image.  It's a small failure but it 
>>>> causes 
>>>> a greater trouble in parsing.
>>>>
>>>> What would you suggest from here to increase accuracy?  
>>>>
>>>>- Increase the number of lines in the training text
>>>>- Mix up more variations in the training text
>>>>- Increase the number of iterations
>>>>- Investigate wrong reads one by one
>>>>- Or else?
>>>>
>>>> Also, I referred to engrestrict*.* and could generate similar result 
>>>> with the fine-tuning-from-full method.  It seems a bit faster to get to 
>>>> the 
>>>> same level but it also stops at a 'good' level.  I can go with either way 
>>>> if it takes me to the bright future.
>>>>
>>>> Regards,
>>>> ElMagoElGato
>>>>
>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>>>
>>>>> Thanks a lot, Shree. I'll look it in.
>>>>>
>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>>>>
>>>>>> See https://github.com/Shreeshrii/tessdata_shreetest
>>>>>>
>>>>>> Look at the files engrestrict*.* and also 
>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>>>>
>>>>>> Create training text of about 100 lines and finetune for 400 lines 
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago  
>>>>>> wrote:
>>>>>>
>>>>>>> I had about 14 lines as attached.  How many lines would you 
>>>>>>> recommend?
>>>>>>>
>>>>>>> Fine tuning gives much better result but it tends to pick other 
>>>>>>> character than in E13B that only has 14 characters, 0 through 9 and 4 
>>>>>>> symbols.  I thought training from scratch would eliminate such 
>>>>>>> confusion.
>>>>>>>
>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>>>>
>>>>>>>> For training from scratch a large training text and hundreds of 
>>>>>>>> thousands of iterations are recommended. 
>>>>>>>>
>>>>>>>> If you are just fine tuning fo

[tesseract-ocr] Re: Training help

2019-06-10 Thread ElGato ElMago

Did you try the tutorial at all? It's a pretty good guidance though you 
might need help here and there.

2019年6月9日日曜日 15時27分23秒 UTC+9 Mox Betex:
>
> Can someone explain me how to create training data for tesseract 4.0?
> I read tutorial on web but I really don't understand.
> Is there some GUI software for training?
> Do I have to create training data  with single font or image of text lines?
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d210a0f3-272f-4478-9420-831e888f065b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Tesseract does not give good output we need some suggestion.

2019-06-10 Thread ElGato ElMago

Do you know what font this is?  Maybe you can train it.

2019年6月10日月曜日 14時33分12秒 UTC+9 Bhamare Harshal:
>
> Hi,
>
> In attached images, we applied fastNlDenosingColored, grayscaling, 
> gaussian blue, mean thresholding, erosion, then black to white (black font 
> on white background),
> but output is not still 100 % acuuract it will take 5 as S, 8 as S some 
> times .
>
> OUTPUT IS AS FOLLOWS
>
> 9754.JPG =>   BMRHRYW] 840KPO82563g
> 9795.JPG =>  BYRHRW1840kP08257 19
> 9795[1].JPG =>  BYRHRW1840kP08257 19
> 10034.JPG =>  MRARWE830KP 1 030¢0
> 10034[1].JPG =>  MRARWE830KP 1 030¢0
> 10527.JPG =>  RMRHRW1840KP062ol2
> 10541.JPG =>  RMRHRW3860KPOL00688
> 10567.JPG =>  MRHRWESS3OKP 108062
>
>
> please help
>
> Regards,
> Harshal
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5ddea0b8-f26f-4cdd-94a2-12f7bef35a5b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Trained data for E13B font

2019-06-09 Thread ElGato ElMago

That'll be nice if there's traineddata out there but I didn't find any.  I 
see free fonts and commercial OCR software but not traineddata.  Tessdata 
repository obviously doesn't have one, either.

2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>
> Please also search for existing MICR traineddata files.
>
> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago  > wrote:
>
>> So I did several tests from scratch.  In the last attempt, I made a 
>> training text with 4,000 lines in the following format,
>>
>> 110004310510<   <02 :4002=0181:801= 0008752 <00039 ;001000;
>>
>>
>> and combined it with eng.digits.training_text in which symbols are 
>> converted to E13B symbols.  This makes about 12,000 lines of training 
>> text.  It's amazing that this thing generates a good reader out of 
>> nowhere.  But then it is not very good.  For example:
>>
>> <01 :1901=1386:021= 001<10001< ;090134;
>>
>> is a result on the image attached.  It's close but the last '<' in the 
>> result text doesn't exist on the image.  It's a small failure but it causes 
>> a greater trouble in parsing.
>>
>> What would you suggest from here to increase accuracy?  
>>
>>- Increase the number of lines in the training text
>>- Mix up more variations in the training text
>>- Increase the number of iterations
>>- Investigate wrong reads one by one
>>- Or else?
>>
>> Also, I referred to engrestrict*.* and could generate similar result with 
>> the fine-tuning-from-full method.  It seems a bit faster to get to the same 
>> level but it also stops at a 'good' level.  I can go with either way if it 
>> takes me to the bright future.
>>
>> Regards,
>> ElMagoElGato
>>
>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>
>>> Thanks a lot, Shree. I'll look it in.
>>>
>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>>
>>>> See https://github.com/Shreeshrii/tessdata_shreetest
>>>>
>>>> Look at the files engrestrict*.* and also 
>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>>
>>>> Create training text of about 100 lines and finetune for 400 lines 
>>>>
>>>>
>>>>
>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago  
>>>> wrote:
>>>>
>>>>> I had about 14 lines as attached.  How many lines would you recommend?
>>>>>
>>>>> Fine tuning gives much better result but it tends to pick other 
>>>>> character than in E13B that only has 14 characters, 0 through 9 and 4 
>>>>> symbols.  I thought training from scratch would eliminate such confusion.
>>>>>
>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>>
>>>>>> For training from scratch a large training text and hundreds of 
>>>>>> thousands of iterations are recommended. 
>>>>>>
>>>>>> If you are just fine tuning for a font try to follow instructions for 
>>>>>> training for impact, with your font.
>>>>>>
>>>>>>
>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago,  
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks, Shree.
>>>>>>>
>>>>>>> Yes, I saw the instruction.  The steps I made are as follows:
>>>>>>>
>>>>>>> Using tesstrain.sh:
>>>>>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng 
>>>>>>> --linedata_only \
>>>>>>>   --noextract_font_properties --langdata_dir ../langdata \
>>>>>>>   --tessdata_dir ./tessdata \
>>>>>>>   --fontlist "E13Bnsd" --output_dir ~/tesstutorial/e13beval \
>>>>>>>   --training_text ../langdata/eng/eng.training_e13b_text
>>>>>>>
>>>>>>> Training from scratch:
>>>>>>> mkdir -p ~/tesstutorial/e13boutput
>>>>>>> src/training/lstmtraining --debug_interval 100 \
>>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 
>>>>>>> O1c111]' \
>>>>>>>   --model_output ~/tesstutorial/e13boutput/base --learning_rate 
>>>>>>> 20e-4 \
>>>>>>>   --train_listfile ~/tesstutorial/e13beval/eng.

[tesseract-ocr] Re: Can 100s of repeats of a fixed paragraph of text with only tiny variances be used to enhance accuracy?

2019-06-06 Thread ElGato ElMago

Do you not use dictionary?

2019年6月7日金曜日 3時50分51秒 UTC+9 Charles885:
>
> Hi,
>
> Is there a way of enhancing Tesseract's accuracy with quite low quality 
> scans by telling it that it will consistently see the same fixed paragraph 
> of text with only minor variations? Essentially, associating word order and 
> a fixed vocabulary to enhance recognition.
>
> I have hundreds of scanned paragraphs like this:
>
> "Be it enacted by the Senate and House of Representatives of the United 
> States of America in Congress assembled, That the Secretary of the Interior 
> be, and he is hereby, authorized and directed to place on the pension roll, 
> subject to the provisions and limitations of the pension laws, the name of 
> *John 
> Bullamore, late of Company G, Second Regiment Wisconsin Volunteer Cavalry*, 
> and pay him a pension at the rate of *thirty *dollars per month. 
> Approved, *January 30, 1904*."
>
> Only the parts I have highlighted will ever change between paragraphs, so 
> if I could only find a way of achieving it, Tesseract would know *exactly* 
> how 
> to OCR the boiler plate parts of the following to a very high level of 
> accuracy (actual Tesseract output below):
>
> "Be it enacted by the Senate and House of R{{resentat·a}ves of the United 
> States of America   Congress assembled, at the Secretrry of the Interior 
> be, and he is here y, authorized and directed tc; palace on _ e nsion roll, 
> subgect to the provisions and lim1tations`o ‘ e pension
> ‘ lfws, the name of  John Bul amore, late of Lompany (x,_ Second Regiment 
> Wisconsin Volunteer Cavalry, and pay him a pension at the rate of thirty 
> dollars per month in lieu of that he is now rece1ving. Approved, January 
> 30, 1904."
>
> In the first variable string, the words "widow, late, Company, Cavalry, 
> Infantry, Artillery, Regiment, Volunteer" plus various ranks and any number 
> of States' names also appear very frequently
> In the second, only a very small number of payment amounts appear to be 
> used (from memory it's 8, 10, 12, 16, 20, 24, 30, 40 and 50)
> In the third variable string, there are only 12 words to choose from!
>
> If my larger thesis is achievable, ideally each of these parts could also 
> be trained with an additional subset of vocabulary?
>
> Thanks and kind regards
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/570e6923-a312-4fc2-9244-41c0c742af59%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Trained data for E13B font

2019-05-30 Thread ElGato ElMago

Thanks a lot, Shree. I'll look it in.

2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>
> See https://github.com/Shreeshrii/tessdata_shreetest
>
> Look at the files engrestrict*.* and also 
> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>
> Create training text of about 100 lines and finetune for 400 lines 
>
>
>
> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago  > wrote:
>
>> I had about 14 lines as attached.  How many lines would you recommend?
>>
>> Fine tuning gives much better result but it tends to pick other character 
>> than in E13B that only has 14 characters, 0 through 9 and 4 symbols.  I 
>> thought training from scratch would eliminate such confusion.
>>
>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>
>>> For training from scratch a large training text and hundreds of 
>>> thousands of iterations are recommended. 
>>>
>>> If you are just fine tuning for a font try to follow instructions for 
>>> training for impact, with your font.
>>>
>>>
>>> On Thu, 30 May 2019, 06:05 ElGato ElMago,  wrote:
>>>
>>>> Thanks, Shree.
>>>>
>>>> Yes, I saw the instruction.  The steps I made are as follows:
>>>>
>>>> Using tesstrain.sh:
>>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng 
>>>> --linedata_only \
>>>>   --noextract_font_properties --langdata_dir ../langdata \
>>>>   --tessdata_dir ./tessdata \
>>>>   --fontlist "E13Bnsd" --output_dir ~/tesstutorial/e13beval \
>>>>   --training_text ../langdata/eng/eng.training_e13b_text
>>>>
>>>> Training from scratch:
>>>> mkdir -p ~/tesstutorial/e13boutput
>>>> src/training/lstmtraining --debug_interval 100 \
>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 
>>>> O1c111]' \
>>>>   --model_output ~/tesstutorial/e13boutput/base --learning_rate 20e-4 \
>>>>   --train_listfile ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>   --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>   --max_iterations 5000 &>~/tesstutorial/e13boutput/basetrain.log
>>>>
>>>> Test with base_checkpoint:
>>>> src/training/lstmeval --model ~/tesstutorial/e13boutput/base_checkpoint 
>>>> \
>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>   --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt
>>>>
>>>> Combining output files:
>>>> src/training/lstmtraining --stop_training \
>>>>   --continue_from ~/tesstutorial/e13boutput/base_checkpoint \
>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>   --model_output ~/tesstutorial/e13boutput/eng.traineddata
>>>>
>>>> Test with eng.traineddata:
>>>> tesseract e13b.png out --tessdata-dir 
>>>> /home/koichi/tesstutorial/e13boutput
>>>>
>>>>
>>>> The training from scratch ended as:
>>>>
>>>> At iteration 561/2500/2500, Mean rms=0.159%, delta=0%, char train=0%, 
>>>> word train=0%, skip ratio=0%,  New best char error = 0 wrote best 
>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint wrote 
>>>> checkpoint.
>>>>
>>>>
>>>> The test with base_checkpoint returns nothing as:
>>>>
>>>> At iteration 0, stage 0, Eval Char error rate=0, Word error rate=0
>>>>
>>>>
>>>> The test with eng.traineddata and e13b.png returns out.txt.  Both files 
>>>> are attached.
>>>>
>>>> Training seems to have worked fine.  I don't know how to translate the 
>>>> test result from base_checkpoint.  The generated eng.traineddata obviously 
>>>> doesn't work well. I suspect the choice of --traineddata in combining 
>>>> output files is bad but I have no clue.
>>>>
>>>> Regards,
>>>> ElMagoElGato
>>>>
>>>> BTW, I referred to your tess4training in the process.  It helped a lot.
>>>>
>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree:
>>>>>
>>>>> see 
>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files
>>>>>
>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago  
>>>>>

Re: [tesseract-ocr] Trained data for E13B font

2019-05-29 Thread ElGato ElMago

I had about 14 lines as attached.  How many lines would you recommend?

Fine tuning gives much better result but it tends to pick other character 
than in E13B that only has 14 characters, 0 through 9 and 4 symbols.  I 
thought training from scratch would eliminate such confusion.

2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>
> For training from scratch a large training text and hundreds of thousands 
> of iterations are recommended. 
>
> If you are just fine tuning for a font try to follow instructions for 
> training for impact, with your font.
>
>
> On Thu, 30 May 2019, 06:05 ElGato ElMago,  > wrote:
>
>> Thanks, Shree.
>>
>> Yes, I saw the instruction.  The steps I made are as follows:
>>
>> Using tesstrain.sh:
>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng 
>> --linedata_only \
>>   --noextract_font_properties --langdata_dir ../langdata \
>>   --tessdata_dir ./tessdata \
>>   --fontlist "E13Bnsd" --output_dir ~/tesstutorial/e13beval \
>>   --training_text ../langdata/eng/eng.training_e13b_text
>>
>> Training from scratch:
>> mkdir -p ~/tesstutorial/e13boutput
>> src/training/lstmtraining --debug_interval 100 \
>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' 
>> \
>>   --model_output ~/tesstutorial/e13boutput/base --learning_rate 20e-4 \
>>   --train_listfile ~/tesstutorial/e13beval/eng.training_files.txt \
>>   --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt \
>>   --max_iterations 5000 &>~/tesstutorial/e13boutput/basetrain.log
>>
>> Test with base_checkpoint:
>> src/training/lstmeval --model ~/tesstutorial/e13boutput/base_checkpoint \
>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>   --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt
>>
>> Combining output files:
>> src/training/lstmtraining --stop_training \
>>   --continue_from ~/tesstutorial/e13boutput/base_checkpoint \
>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>   --model_output ~/tesstutorial/e13boutput/eng.traineddata
>>
>> Test with eng.traineddata:
>> tesseract e13b.png out --tessdata-dir /home/koichi/tesstutorial/e13boutput
>>
>>
>> The training from scratch ended as:
>>
>> At iteration 561/2500/2500, Mean rms=0.159%, delta=0%, char train=0%, 
>> word train=0%, skip ratio=0%,  New best char error = 0 wrote best 
>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint wrote 
>> checkpoint.
>>
>>
>> The test with base_checkpoint returns nothing as:
>>
>> At iteration 0, stage 0, Eval Char error rate=0, Word error rate=0
>>
>>
>> The test with eng.traineddata and e13b.png returns out.txt.  Both files 
>> are attached.
>>
>> Training seems to have worked fine.  I don't know how to translate the 
>> test result from base_checkpoint.  The generated eng.traineddata obviously 
>> doesn't work well. I suspect the choice of --traineddata in combining 
>> output files is bad but I have no clue.
>>
>> Regards,
>> ElMagoElGato
>>
>> BTW, I referred to your tess4training in the process.  It helped a lot.
>>
>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree:
>>>
>>> see 
>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files
>>>
>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago  
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I wish to make a trained data for E13B font.
>>>>
>>>> I read the training tutorial and made a base_checkpoint file according 
>>>> to the method in Training From Scratch.  Now, how can I make a trained 
>>>> data 
>>>> from the base_checkpoint file?
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesser...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702

Re: [tesseract-ocr] Trained data for E13B font

2019-05-29 Thread ElGato ElMago

Thanks, Shree.

Yes, I saw the instruction.  The steps I made are as follows:

Using tesstrain.sh:
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng 
--linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata \
  --fontlist "E13Bnsd" --output_dir ~/tesstutorial/e13beval \
  --training_text ../langdata/eng/eng.training_e13b_text

Training from scratch:
mkdir -p ~/tesstutorial/e13boutput
src/training/lstmtraining --debug_interval 100 \
  --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
  --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
  --model_output ~/tesstutorial/e13boutput/base --learning_rate 20e-4 \
  --train_listfile ~/tesstutorial/e13beval/eng.training_files.txt \
  --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt \
  --max_iterations 5000 &>~/tesstutorial/e13boutput/basetrain.log

Test with base_checkpoint:
src/training/lstmeval --model ~/tesstutorial/e13boutput/base_checkpoint \
  --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt

Combining output files:
src/training/lstmtraining --stop_training \
  --continue_from ~/tesstutorial/e13boutput/base_checkpoint \
  --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
  --model_output ~/tesstutorial/e13boutput/eng.traineddata

Test with eng.traineddata:
tesseract e13b.png out --tessdata-dir /home/koichi/tesstutorial/e13boutput


The training from scratch ended as:

At iteration 561/2500/2500, Mean rms=0.159%, delta=0%, char train=0%, word 
train=0%, skip ratio=0%,  New best char error = 0 wrote best 
model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint wrote 
checkpoint.


The test with base_checkpoint returns nothing as:

At iteration 0, stage 0, Eval Char error rate=0, Word error rate=0


The test with eng.traineddata and e13b.png returns out.txt.  Both files are 
attached.

Training seems to have worked fine.  I don't know how to translate the test 
result from base_checkpoint.  The generated eng.traineddata obviously 
doesn't work well. I suspect the choice of --traineddata in combining 
output files is bad but I have no clue.

Regards,
ElMagoElGato

BTW, I referred to your tess4training in the process.  It helped a lot.

2019年5月29日水曜日 19時14分08秒 UTC+9 shree:
>
> see 
> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files
>
> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago  > wrote:
>
>> Hi,
>>
>> I wish to make a trained data for E13B font.
>>
>> I read the training tutorial and made a base_checkpoint file according to 
>> the method in Training From Scratch.  Now, how can I make a trained data 
>> from the base_checkpoint file?
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesser...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
 

8

[tesseract-ocr] Trained data for E13B font

2019-05-29 Thread ElGato ElMago

Hi,

I wish to make a trained data for E13B font.

I read the training tutorial and made a base_checkpoint file according to 
the method in Training From Scratch.  Now, how can I make a trained data 
from the base_checkpoint file?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

40 matches

Mail list logo