Re: [tesseract-ocr] Re: How can I do the training using my own image in Tesseract 4.0

2018-01-10 Thread Anubhav Rohatgi
Hi Shree,


The box file uploaded by you as the attachment seems to contradict with the 
LSTM4.0 training tutorial guidelines, as there it states that the boxes 
should actually be at line level instead of at character level. Please do 
correct me if I am wrong. I still am not able to understand how to train 
tesseract on real image data I have collected from scanned documents. It 
would be beneficial to all of us here if we could have a sample video 
guiding us on how to train tesseract, at least the starting steps with 
proper commands.

Thanks in advance.
Anubhav


On Tuesday, 7 February 2017 21:04:11 UTC+5:30, shree wrote:
>
> ​For LSTM training, box files need to have an additional line for each 
> text line with the tab character to indicate a new line.
>
> If you have existing box/tiff pairs, you can use a box editor (such as 
> jtessboxeditor) and insert a box at end of each line and add a tab 
> character in it.
>
> >On the toolbar, the Character textbox has a built-in conversion 
> function. If you enter U+0009 and hit Enter key or click on the adjacent 
> Tool icon, the escape sequences will be converted to Unicode. You can also 
> enter the tab character via Alt+09 numpad keys on Windows.
>
> o
> ​r add a dummy sequence such as @@@ and then replace to tab character in a 
> text editor.
> ​
> ​See attached files as a sample.
>
> Then modify tesstrain.sh to copy the box tiff pairs to the training 
> directory before starting training
>
>
>
> mkdir -p ${TRAINING_DIR}
> tlog "\n=== Starting training for language '${LANG_CODE}'"
>
> cp  ./*.box "${TRAINING_DIR}/"
> cp  ./*.tif "${TRAINING_DIR}/"​
>
>
> On Tue, Feb 7, 2017 at 8:27 PM, Kay-Michael Würzner  > wrote:
>
>> +1 for this question. The training documentation for Tesseract 4.0 by now 
>> only covers training with font files (synthetic materials). What is missing 
>> is information on training with real data (i.e. manually aligned ground 
>> truth).
>> Any hints on that matter are greatly appreciated.
>>
>> Cheers,
>> Kay
>>
>> On Wednesday, January 18, 2017 at 12:31:54 AM UTC+1, chen...@huawei.com 
>> wrote:
>>>
>>> I have a bunch of images, containing English words.
>>> I would like to generate training data by these images, and do the 
>>> training.
>>> How should I do?
>>>
>>> Thanks a lot.
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/7bffab95-3e6b-4165-929e-a152f1799703%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bc9e908a-add3-41c6-b418-6b30c314905d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Invalid Digit recognition

2018-01-10 Thread mark
Hi 

Just stumbled on this forum while looking for answers as to why the 
Tesseract Demo on the site would fail with my images (using very similar 
approach of single digits in images etc etc)

Found that scaling the image height by 50% worked a charm thanks!! Never 
thought to do that!! Also cropped the image as close as possible to the 
digit to remove all background crap and that remove the [ ] I was getting 
on each side of the correct digit. 

Just wanted to say thanks for your tips

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c11f492f-e5d6-4c54-91bd-e914ebaa28dd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Need Help with extracting info from Invoice

2018-01-10 Thread Afreen Ferdoash
it is still not making any difference

On Wednesday, January 10, 2018 at 9:27:20 PM UTC+5:30, shree wrote:
>
>
> On Wed, Jan 10, 2018 at 8:07 PM, Afreen Ferdoash  > wrote:
>
>> I am trying to solve a similar problem, that of reading forms.  Tesseract 
>> 4 is doing well  but is DROPPING lots of words withing boxes. I thought 
>> this problem of dropping words existed with Indic languages but here I am 
>> having this issue for English too! 
>> I tried to fool around with some parameters but whatever handful I tried 
>> didn't lead to *any* change in the output.
>>
>> @Shree : Can you  please suggest something since you too faced this issue 
>> earlier with  another language ?
>>
>>
> ​Please see 
> https://github.com/tesseract-ocr/tesseract/issues/681#issuecomment-356358284
>
> @amido has offered a patch.​
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3643f0da-dc68-4142-a10c-93275e497b28%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Working on degraded test data

2018-01-10 Thread gale
Hi guys , I am working on some degraded text image ( Japanese ) . Is there 
any way to adjust Degraded Image on training set ? And should I do this ? 
Regard 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/adbb3854-8400-4599-ba22-c8c85c51cf1b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Need Help with extracting info from Invoice

2018-01-10 Thread ShreeDevi Kumar
On Wed, Jan 10, 2018 at 8:07 PM, Afreen Ferdoash 
wrote:

> I am trying to solve a similar problem, that of reading forms.  Tesseract
> 4 is doing well  but is DROPPING lots of words withing boxes. I thought
> this problem of dropping words existed with Indic languages but here I am
> having this issue for English too!
> I tried to fool around with some parameters but whatever handful I tried
> didn't lead to *any* change in the output.
>
> @Shree : Can you  please suggest something since you too faced this issue
> earlier with  another language ?
>
>
​Please see
https://github.com/tesseract-ocr/tesseract/issues/681#issuecomment-356358284

@amido has offered a patch.​

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX04xz7Q75QbSUsytwKdTuRb_cgkEFdefks11_F23sy%2Bw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Need Help with extracting info from Invoice

2018-01-10 Thread Afreen Ferdoash
I am trying to solve a similar problem, that of reading forms.  Tesseract 4 
is doing well  but is DROPPING lots of words withing boxes. I thought this 
problem of dropping words existed with Indic languages but here I am having 
this issue for English too! 
I tried to fool around with some parameters but whatever handful I tried 
didn't lead to *any* change in the output.

@Shree : Can you  please suggest something since you too faced this issue 
earlier with  another language ?

On Wednesday, January 10, 2018 at 3:17:01 PM UTC+5:30, shree wrote:
>
> See https://github.com/tesseract-ocr/tesseract/wiki/APIExample
>
> For example of using tesseract in a program.
>
> The training tutorial you refer to is old. 
> See tesstrain.sh for creating synthetic training data.
>
> On 10-Jan-2018 2:54 PM, "saumitra mallick"  > wrote:
>
>> Hello all ,
>>>
>> I'm working on similar project , in my case i'm reading bank statements. 
>> I noticed the following 
>> 1. when you have a single line of text tesseract performs much better
>> 2. I'm using openCV to cut individual cells from a table (you always know 
>> the order of cells since you cut them )
>> 3. once you have data in individual cells (image files ), single line 
>> data gives much accurate results than multiline data ( anyone tried LSTM , 
>> instead of reading full text , maybe cut down individual cells to 
>> individual line and use line recognition with tesseract  ?? Please let me 
>> know the results ) 
>>
>> I need help for :
>> - how do I use tesseract in my C++ code , for the time being I'm using 
>> tesseract from command line 
>> - Please post a sample program for me ,which does the following 
>>-  make tesseract read an image 
>>-  generate text output from it and write it to a file 
>>
>> If you guys are facing bumps in generating traineddata this post might 
>> help  
>>
>> http://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/
>>
>> Please let me know if anyone is interested in sharing knowledge with me 
>> about the same .
>>
>> Contact me at saumitr...@gmail.com  
>>
>> Best Regards
>> Saumitra Mallick
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/b97b440c-3ecd-4cf5-9bad-f94a98b54654%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0f74ccf2-f36e-49fb-85fa-b765473fbc36%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] VietOCR 5.0 alpha availability

2018-01-10 Thread Quan Nguyen
Just updated again to use Tesseract 4.00 fast data.

On Monday, January 8, 2018 at 5:16:50 PM UTC-6, Quan Nguyen wrote:
>
> Just updated the alpha versions with latest Tesseract 4.00alpha 
> executables.
>
> https://sourceforge.net/projects/vietocr/files/
>
> On Monday, April 3, 2017 at 6:26:37 AM UTC-5, shree wrote:
>>
>> You need to get vietocr 5.0 alpha for tesseract 4.0 alpha
>>
>> https://sourceforge.net/projects/vietocr/files/vietocr.net/5.0alpha/
>>
>> https://sourceforge.net/projects/vietocr/files/vietocr/5.0alpha/
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Mon, Apr 3, 2017 at 2:52 PM, El Fakir Zakaria  
>> wrote:
>>
>>> this is using Tesseract 3.04 not 4.00alpha ?
>>>
>>> 2017-03-31 18:13 GMT+01:00 Quan Nguyen :
>>>
 VietOCR 5.0 alpha, Java & .NET GUI frontend for Tesseract 4.00alpha, is 
 available for download. Any feedback is welcome. Thanks.

 https://sourceforge.net/projects/vietocr/files/


 -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/aa63499d-1375-4c08-bf1d-e87c00f9b8cd%40googlegroups.com
  
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/CALjY3nP4%2BA68yvfyVXGdFQATTMkVc7BpQdk_5VBgKQDMte-vKw%40mail.gmail.com
>>>  
>>> 
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6eb3a5e7-7b3c-4392-ba3f-820e878ce27b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Variables having no effect on C# Tesseract.net 4.0.0.6 wrapper

2018-01-10 Thread James Q
Here is my code:
string text = "";

string tessDataPath = ConfigurationManager.AppSettings["TessPath"];
using (var engine = new TessBaseAPI(@tessDataPath, @"eng"))
{
engine.SetVariable("tessedit_ocr_engine_mode", "0");
engine.SetPageSegMode(PageSegmentationMode.SINGLE_LINE);
engine.SetVariable("tessedit_char_blacklist", type.GetTesseractOptions
().Blacklist());
engine.SetVariable("tessedit_char_whitelist", type.GetTesseractOptions
().Whitelist());
engine.Process(imageFileName, false);
text = engine.GetUTF8Text();

}

I'm sending images which represent one or a few words on a single line, but 
in the above code, the SetPageSegMode(..) method has no effect. On the 
command line I can use:

tesseract.exe input.png result -l eng --psm 7 --oem 1

on the same images and see clearly better results on psm 7. Does anyone 
know how to configure this option via the wrapper or is it just not 
suppported?

Also, blacklists and whitelists are having no effect in the wrapper. Whilst 
I understand that these are not supported in Tesseract 4 LSTM mode yet, 
they should still work in 'Tesseract Only' mode right? I know the 
SetVariable method works (as I see its effect on engine mode). Is there 
another way of setting blacklists and whitelists through this wrapper?

Thanks
James 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/19eaed8d-7fdc-4b6c-b803-5d23cb4dd49a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Traineddata always ended in same size and did not match with wordlist

2018-01-10 Thread ShreeDevi Kumar
On Wed, Jan 10, 2018 at 3:56 PM,  wrote:

> It works !!
> I modified your bash script and executed it. Finally I get different
> traineddata size.
>
> But, can I train it from scratch?
> It needs starting traineddata which I can get from combine_lang_model,
> isn't it?
>
>
​Starter traineddata will be generated by tesstrain.sh, change the files in
langdata folder.​

​To train from scratch, you need to change the lstmtraining command. It
will not need continue_from and old_traineddata.

You will need to add a network specification - such as

 --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \

​Usually the best traineddata will have the network spec used for training
by Ray as part of the version string.

See https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
for more details.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVhehmj_FRBVToQy28guj_0Eu7dCEsheXa8dJkuhrV7Dw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Traineddata always ended in same size and did not match with wordlist

2018-01-10 Thread easymavinmind
It works !!
I modified your bash script and executed it. Finally I get different 
traineddata size.

But, can I train it from scratch?
It needs starting traineddata which I can get from combine_lang_model, 
isn't it?

 
On Tuesday, January 9, 2018 at 7:36:08 PM UTC+7, shree wrote:
>
>
>> My reason for using combine_lang_data is to make my punc, wordlist, and 
>> numbers effects the trainned data.. Or, it doesn't work like that?
>>
>
> ​If you update the files in langdata folder and then run tesstrain.sh, it 
> will automatically use your files.
> ​
>
>>
>> Now, I will try your shell script for training, and will share the result 
>> if its done 
>>
>
> ​You will need to modify it according to the location of your files.
>
> Also, update the fonts list as per your requirements.
> ​
>
>>
>>
>> On Tuesday, January 9, 2018 at 6:17:40 PM UTC+7, shree wrote:
>>>
>>> 1. If you use tesstrain.sh, it will create the starter traineddata, you 
>>> do NOT need to run combine_lang_data. If you want to change version string, 
>>> look at tesstrain_utils.sh and modify the command in it.
>>>
>>> 2. If you are always getting the same size file, it looks like that you 
>>> are probably copying some old file as traineddata as part of your script. 
>>> It could be copying from a wrong folder or some such thing.
>>>
>>> I am attaching a bash script, you can modify it for your setup and try 
>>> if that helps.
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Tue, Jan 9, 2018 at 9:39 AM,  wrote:
>>>
 Yes, I did the following command in tesseract/training directory:

 lstmtraining --stop_training --continue_from 
 ../result/mylangoutput/base_checkpoint --traineddata 
 ../result/mylangcombine/mylang/mylang.traineddata --model_output 
 ../result/mylangoutput/mylang.traineddata

 On Monday, January 8, 2018 at 7:36:50 PM UTC+7, shree wrote:
>
> Did you use --stop_training flag at the end?
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Mon, Jan 8, 2018 at 5:51 PM,  wrote:
>
>> Hi all,
>>
>> I am doing my project using Tesseract v4.00, and always getting the 
>> traineddata output in the same size after training with my own data.
>> I suppose that I did not do the steps correctly..
>>
>> The only data that I provided were:
>> 1. training_text
>> 2. puncs (I just reduced the general punc as provided in tesseract 
>> github)
>> 3. numbers
>> 4. wordlists (I made various wordlists for several training, ranging 
>> between 100.000 - 2.000.000) 
>> 5. font name (I also made various fonts for several training, ranging 
>> between 1 - 20 fonts)
>>
>> The steps that I did were:
>> 1. Made tiff file, unicharset and other complement data using 
>> tesstrain.sh
>> 2. Made tiff file, unicharset and other complement data using 
>> tesstrain.sh for evaluation
>> 3. Combined unicharset, wordlists, puncs, numbers and version_str to 
>> create started traineddata using combine_lang_data ( I am still not 
>> confident with the value of version_str though)
>> 4. Trained data using lstmtraining
>> 5. Combined all output file using lstmtraining --continue_from ...
>>
>> Yet, all of my training ended with same size which is 10.5MB..
>> Did I do all my steps correctly?
>>
>> Once, I also trained with modifying WORD_DAWG_FACTOR in 
>> language_spesific.sh to 0 and 1, because I want to read the text and 
>> match 
>> 100% with my wordlists. But, the result also did not satisfy me, some 
>> words 
>> are not in my wordlists such as "USISUSISU".
>> Do you know whats the cause?
>>
>> I really appreciate if anyone can help or suggest any solution.
>> Thankyou !!
>>
>> -- 
>> You received this message because you are subscribed to the Google 
>> Groups "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, 
>> send an email to tesseract-oc...@googlegroups.com.
>> To post to this group, send email to tesser...@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/b6ca74b2-1e50-44cb-93f6-586fcd26cec5%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and 

Re: [tesseract-ocr] Re: Need Help with extracting info from Invoice

2018-01-10 Thread ShreeDevi Kumar
See https://github.com/tesseract-ocr/tesseract/wiki/APIExample

For example of using tesseract in a program.

The training tutorial you refer to is old.
See tesstrain.sh for creating synthetic training data.

On 10-Jan-2018 2:54 PM, "saumitra mallick" 
wrote:

> Hello all ,
>>
> I'm working on similar project , in my case i'm reading bank statements. I
> noticed the following
> 1. when you have a single line of text tesseract performs much better
> 2. I'm using openCV to cut individual cells from a table (you always know
> the order of cells since you cut them )
> 3. once you have data in individual cells (image files ), single line data
> gives much accurate results than multiline data ( anyone tried LSTM ,
> instead of reading full text , maybe cut down individual cells to
> individual line and use line recognition with tesseract  ?? Please let me
> know the results )
>
> I need help for :
> - how do I use tesseract in my C++ code , for the time being I'm using
> tesseract from command line
> - Please post a sample program for me ,which does the following
>-  make tesseract read an image
>-  generate text output from it and write it to a file
>
> If you guys are facing bumps in generating traineddata this post might
> help
> http://pretius.com/how-to-prepare-training-files-for-
> tesseract-ocr-and-improve-characters-recognition/
>
> Please let me know if anyone is interested in sharing knowledge with me
> about the same .
>
> Contact me at saumitramall...@gmail.com
>
> Best Regards
> Saumitra Mallick
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/b97b440c-3ecd-4cf5-9bad-f94a98b54654%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXChTy9ORG-cszEf4df_uBY-Ayx4Rb5aebkb3goqsoQaQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Need Help with extracting info from Invoice

2018-01-10 Thread saumitra mallick

>
> Hello all ,
>
I'm working on similar project , in my case i'm reading bank statements. I 
noticed the following 
1. when you have a single line of text tesseract performs much better
2. I'm using openCV to cut individual cells from a table (you always know 
the order of cells since you cut them )
3. once you have data in individual cells (image files ), single line data 
gives much accurate results than multiline data ( anyone tried LSTM , 
instead of reading full text , maybe cut down individual cells to 
individual line and use line recognition with tesseract  ?? Please let me 
know the results ) 

I need help for :
- how do I use tesseract in my C++ code , for the time being I'm using 
tesseract from command line 
- Please post a sample program for me ,which does the following 
   -  make tesseract read an image 
   -  generate text output from it and write it to a file 

If you guys are facing bumps in generating traineddata this post might 
help  
http://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/

Please let me know if anyone is interested in sharing knowledge with me 
about the same .

Contact me at saumitramall...@gmail.com 

Best Regards
Saumitra Mallick

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b97b440c-3ecd-4cf5-9bad-f94a98b54654%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.