[tesseract-ocr] Re: how to use PDF as Input

2018-01-04 Thread Subhanshu Gupta
Thanks Quan. :)

On Thursday, January 4, 2018 at 10:15:09 PM UTC+5:30, Quan Nguyen wrote:
>
> You can specify a .uzn file defining the zones.
>
> https://groups.google.com/forum/#!topic/tesseract-ocr/M0o5az7Zoo8
>
> On Thursday, January 4, 2018 at 7:37:48 AM UTC-6, Subhanshu Gupta wrote:
>>
>> Thanks Quan. One more thing, how can I use Tesseract to read a form 
>> having different data fields like Name, Address, etc. and save the 
>> corresponding data to somewhere else?
>>
>>
>> On Thursday, January 4, 2018 at 6:51:48 AM UTC+5:30, Quan Nguyen wrote:
>>>
>>> Tesseract engine cannot read PDF. You'll have to convert them to 
>>> suitable images (TIFF or PNG) first. There are many tools for that: 
>>> ImageMagick, GhostScript, PDFBox, etc.
>>>
>>> On Wednesday, January 3, 2018 at 12:05:12 PM UTC-6, Subhanshu Gupta 
>>> wrote:

 Dear All,

 I am new to Tesseract OCR and need to implement it to Read PDF Forms 
 but I am not able to find any good documentation for which method to use 
 to 
 read PDF as well as for Character Segmentation.
 If any of you have any doc/manual relating on which method is used 
 where it will be really very helpful.

 Thanks. :)

>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/72a1039e-4b56-44e8-8cd9-3817d918b726%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Use Tesseract as a library at Windows with Qt

2018-01-04 Thread John Grossman
ugh, following up with some corrections on what I hacked together in case 
they help anyone...

Previously, I had said...

"""
My project is a DLL and (for some reason) had Linker -> Link Library 
Dependencies set to "yes".  I needed to turn this off to avoid a situation 
where two different copies of MSVCRT were getting linked (you probably will 
not have to do this).
"""

This is not correct.  This option just tells the build system to link the 
import library for dependencies (in my case, set by adding the libtesseract 
project as a dependency of the library I am building).  Turning this off 
didn't fix my problem, it just hid it.  As soon as I tried to actually use 
the library, I failed to link because none of the tesseract symbols were 
available.

Turning this back on then caused the multiple symbol collision problem I 
was having earlier.  I tracked this down to the following cause...

cppan generated a cmake file which used the cmake __create_def feature to 
automatically create a dll export file for the project.  The vcxproj file 
was set to use this generated exports def file.  The trouble is that the 
automatically generated file was WAY to inclusive; it was exporting pretty 
much all of the symbols from std:: in addition to everything else.  So, 
when it came time to link, all of the symbols from the standard library I 
was using in my library were colliding with the symbols from the 
libtesseract import library which had been generated.

The thing is, generating an explicit export definition file is not needed 
at all.  The tesseract library is already well set up to flag what should 
and should not be exported using the TESS_API macro (defined in 
ccutil\platform.h).  So, all I needed to do is remove the explicit export 
definition file and everything worked out just fine (Configuration 
Properties -> Linker -> All Options -> Module Definition File should be 
empty).

-john

>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a1016373-c8a0-4293-baa3-e8cb41d65e2d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] How to configure Tesseract4alfa for OEM 2

2018-01-04 Thread Debapriya Rahut
I used VietOcr5 Alpha which uses psm  3 and oem 2 mode, and produce output 
for Bengali language.

But when i use direct command from console command for these mode, It says 
couldn't open oem 2. It produced output in Legacy mode. Output is better 
from any previous version. Eroor is very small. I Like the version.

 How could I use oem 2 for Tesseract4 Alpha.

Regards
Debapriya Rahut.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fed80291-44b3-4122-9e0a-f3a9661d339d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: How to use tesseract4.0 to only recognize the digits??

2018-01-04 Thread ShreeDevi Kumar
Best and fast are both from the same check point.

You have to use convert_to_int with stop_training to convert the model from
floating point to integer.

Please see
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#lstmtraining-command-line

for the exact syntax.

Since digits traineddata is not adding any characters, you will probably
need fewer iterations.

I had created this traineddata in response to a post in the forum and had
used number formats in training text and font similar to the sample image
provided.



On 04-Jan-2018 11:54 PM, "Thomas Menguy"  wrote:

> Thanks! Really great you took the time, very much appreciated, with that
> level of information we I’ll be able to find ou way :)
>
> For your set which fonts did you use? (You have a best and a fast one)
>
> Thanks again
> Thomas
>
> Envoyé de mon iPhone
>
> Le 4 janv. 2018 à 17:19, ShreeDevi Kumar  a écrit :
>
> I am attaching a zip file.
>
> The files in langdata/eng are my modified version of training text and
> input files for punctuation and number formats. You can modify them further
> to match your requirements.
>
> I could not find a saved script with the command I used. Instead please
> see attached engtrain.sh - it was posted by one of users in the forum. You
> will need to modify it based on the file locations on your system. If you
> know the font used in the images you need to ocr, you can train with just
> that font/similar fonts.
>
>
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Thu, Jan 4, 2018 at 7:23 PM, Thomas Menguy 
> wrote:
>
>> Thanks a lot, seen the tutorial but was a bit confused as it is made to «
>> remove » characters to let only the digits, but was not sure which chars to
>> be removed ...(the whole Unicode minus the digits?) ...
>> Anyway thanks again for the answer ... would be awesome if you could find
>> back the command line ;)
>> BR
>>
>> Envoyé de mon iPhone
>>
>> Le 4 janv. 2018 à 10:08, ShreeDevi Kumar  a écrit :
>>
>> I will have to look for the exact commands and training text I used at
>> that time.
>>
>> You should be able to recreate the training by following instructions
>> given at https://github.com/tesseract-ocr/tesseract/wiki/TrainingT
>> esseract-4.00#fine-tuning-for--a-few-characters
>>
>> I had modified the english langdata files and then finally renamed the
>> traineddata to digits after completing training.
>>
>> Create a training text which has digits and signs.
>>
>> Replace the word list to match the kind of number patterns you expect or
>> don't use a word list at all.
>>
>>
>>
>> On 04-Jan-2018 12:04 PM, "Thomas Menguy"  wrote:
>>
>> Hi Shree,
>>
>> Tried your Data for digits ... really works well!
>> Need to do a training set with number and signs for example ... could you
>> point me on how you've done your own training data (sorry fairly new to
>> Tesseract, never trained it before)
>>
>> Thanks for your help!
>> BR
>>
>> On Tuesday, October 3, 2017 at 6:39:30 PM UTC+2, shree wrote:
>>>
>>> You can try the plus-minus type of training if you just want a digits
>>> type of traineddata.
>>>
>>> Your training_text can contain numbers in the format you need and you
>>> can train with a font matching your images.
>>>
>>> For proof of concept you can try my experimental version at
>>>
>>> https://github.com/Shreeshrii/tessdata4alpha/blob/master/fas
>>> t/digits.traineddata
>>>
>>> On Friday, September 29, 2017 at 12:32:41 PM UTC+5:30, John Miller wrote:

 Today,I found that the problem had been  posted on
 https://github.com/tesseract-ocr/tesseract/issues/751

>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/5f98dc8f-55e9-46dc-84b2-4ee1c7adc868%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit https://groups.google.com/d/to
>> pic/tesseract-ocr/-oeCTcojYfw/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at 

Re: [tesseract-ocr] Re: How to use tesseract4.0 to only recognize the digits??

2018-01-04 Thread Thomas Menguy
Thanks! Really great you took the time, very much appreciated, with that level 
of information we I’ll be able to find ou way :)

For your set which fonts did you use? (You have a best and a fast one)
 
Thanks again
Thomas

Envoyé de mon iPhone

> Le 4 janv. 2018 à 17:19, ShreeDevi Kumar  a écrit :
> 
> I am attaching a zip file.
> 
> The files in langdata/eng are my modified version of training text and input 
> files for punctuation and number formats. You can modify them further to 
> match your requirements.
> 
> I could not find a saved script with the command I used. Instead please see 
> attached engtrain.sh - it was posted by one of users in the forum. You will 
> need to modify it based on the file locations on your system. If you know the 
> font used in the images you need to ocr, you can train with just that 
> font/similar fonts.
> 
> 
> 
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
> 
>> On Thu, Jan 4, 2018 at 7:23 PM, Thomas Menguy  
>> wrote:
>> Thanks a lot, seen the tutorial but was a bit confused as it is made to « 
>> remove » characters to let only the digits, but was not sure which chars to 
>> be removed ...(the whole Unicode minus the digits?) ...
>> Anyway thanks again for the answer ... would be awesome if you could find 
>> back the command line ;)
>> BR
>> 
>> Envoyé de mon iPhone
>> 
>>> Le 4 janv. 2018 à 10:08, ShreeDevi Kumar  a écrit :
>>> 
>>> I will have to look for the exact commands and training text I used at that 
>>> time.
>>> 
>>> You should be able to recreate the training by following instructions given 
>>> at 
>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters
>>> 
>>> I had modified the english langdata files and then finally renamed the 
>>> traineddata to digits after completing training.
>>> 
>>> Create a training text which has digits and signs. 
>>> 
>>> Replace the word list to match the kind of number patterns you expect or 
>>> don't use a word list at all.
>>> 
>>> 
>>> 
>>> On 04-Jan-2018 12:04 PM, "Thomas Menguy"  wrote:
>>> Hi Shree, 
>>> 
>>> Tried your Data for digits ... really works well!
>>> Need to do a training set with number and signs for example ... could you 
>>> point me on how you've done your own training data (sorry fairly new to 
>>> Tesseract, never trained it before)
>>> 
>>> Thanks for your help!
>>> BR
>>> 
 On Tuesday, October 3, 2017 at 6:39:30 PM UTC+2, shree wrote:
 You can try the plus-minus type of training if you just want a digits type 
 of traineddata.
 
 Your training_text can contain numbers in the format you need and you can 
 train with a font matching your images.
 
 For proof of concept you can try my experimental version at 
 
 https://github.com/Shreeshrii/tessdata4alpha/blob/master/fast/digits.traineddata
 
> On Friday, September 29, 2017 at 12:32:41 PM UTC+5:30, John Miller wrote:
> Today,I found that the problem had been  posted on 
> https://github.com/tesseract-ocr/tesseract/issues/751
>>> 
>>> -- 
>>> You received this message because you are subscribed to the Google Groups 
>>> "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/5f98dc8f-55e9-46dc-84b2-4ee1c7adc868%40googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>> 
>>> -- 
>>> You received this message because you are subscribed to a topic in the 
>>> Google Groups "tesseract-ocr" group.
>>> To unsubscribe from this topic, visit 
>>> https://groups.google.com/d/topic/tesseract-ocr/-oeCTcojYfw/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to 
>>> tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXyCd3RFDA0G%3DXyYtUa6Cft1afT4KRrEx2%3DFhZKq_yS%2BQ%40mail.gmail.com.
>>> For more options, visit https://groups.google.com/d/optout.
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To 

[tesseract-ocr] Re: how to use PDF as Input

2018-01-04 Thread Quan Nguyen
You can specify a .uzn file defining the zones.

https://groups.google.com/forum/#!topic/tesseract-ocr/M0o5az7Zoo8

On Thursday, January 4, 2018 at 7:37:48 AM UTC-6, Subhanshu Gupta wrote:
>
> Thanks Quan. One more thing, how can I use Tesseract to read a form having 
> different data fields like Name, Address, etc. and save the corresponding 
> data to somewhere else?
>
>
> On Thursday, January 4, 2018 at 6:51:48 AM UTC+5:30, Quan Nguyen wrote:
>>
>> Tesseract engine cannot read PDF. You'll have to convert them to suitable 
>> images (TIFF or PNG) first. There are many tools for that: ImageMagick, 
>> GhostScript, PDFBox, etc.
>>
>> On Wednesday, January 3, 2018 at 12:05:12 PM UTC-6, Subhanshu Gupta wrote:
>>>
>>> Dear All,
>>>
>>> I am new to Tesseract OCR and need to implement it to Read PDF Forms but 
>>> I am not able to find any good documentation for which method to use to 
>>> read PDF as well as for Character Segmentation.
>>> If any of you have any doc/manual relating on which method is used where 
>>> it will be really very helpful.
>>>
>>> Thanks. :)
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/acd40ce0-46d2-4442-9f83-16a895ac27c0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: How to use tesseract4.0 to only recognize the digits??

2018-01-04 Thread ShreeDevi Kumar
Yes, I had made training text with just digits.

Basically, this cuts down on the unicharset in the traineddata to digits.
It finetunes the existing best model to the chosen subset of characters and
does not require too many iterations.

On 04-Jan-2018 7:23 PM, "Thomas Menguy"  wrote:

> Thanks a lot, seen the tutorial but was a bit confused as it is made to «
> remove » characters to let only the digits, but was not sure which chars to
> be removed ...(the whole Unicode minus the digits?) ...
> Anyway thanks again for the answer ... would be awesome if you could find
> back the command line ;)
> BR
>
> Envoyé de mon iPhone
>
> Le 4 janv. 2018 à 10:08, ShreeDevi Kumar  a écrit :
>
> I will have to look for the exact commands and training text I used at
> that time.
>
> You should be able to recreate the training by following instructions
> given at https://github.com/tesseract-ocr/tesseract/wiki/
> TrainingTesseract-4.00#fine-tuning-for--a-few-characters
>
> I had modified the english langdata files and then finally renamed the
> traineddata to digits after completing training.
>
> Create a training text which has digits and signs.
>
> Replace the word list to match the kind of number patterns you expect or
> don't use a word list at all.
>
>
>
> On 04-Jan-2018 12:04 PM, "Thomas Menguy"  wrote:
>
> Hi Shree,
>
> Tried your Data for digits ... really works well!
> Need to do a training set with number and signs for example ... could you
> point me on how you've done your own training data (sorry fairly new to
> Tesseract, never trained it before)
>
> Thanks for your help!
> BR
>
> On Tuesday, October 3, 2017 at 6:39:30 PM UTC+2, shree wrote:
>>
>> You can try the plus-minus type of training if you just want a digits
>> type of traineddata.
>>
>> Your training_text can contain numbers in the format you need and you can
>> train with a font matching your images.
>>
>> For proof of concept you can try my experimental version at
>>
>> https://github.com/Shreeshrii/tessdata4alpha/blob/master/fas
>> t/digits.traineddata
>>
>> On Friday, September 29, 2017 at 12:32:41 PM UTC+5:30, John Miller wrote:
>>>
>>> Today,I found that the problem had been  posted on
>>> https://github.com/tesseract-ocr/tesseract/issues/751
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/ms
> gid/tesseract-ocr/5f98dc8f-55e9-46dc-84b2-4ee1c7adc868%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/
> topic/tesseract-ocr/-oeCTcojYfw/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/CAG2NduXyCd3RFDA0G%3DXyYtUa6Cft1afT4KRrEx2%
> 3DFhZKq_yS%2BQ%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/58D78AED-8C8D-44C9-9C70-B7BB5B7E19AE%40gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at 

Re: [tesseract-ocr] Re: How to use tesseract4.0 to only recognize the digits??

2018-01-04 Thread Thomas Menguy
Thanks a lot, seen the tutorial but was a bit confused as it is made to « 
remove » characters to let only the digits, but was not sure which chars to be 
removed ...(the whole Unicode minus the digits?) ...
Anyway thanks again for the answer ... would be awesome if you could find back 
the command line ;)
BR

Envoyé de mon iPhone

> Le 4 janv. 2018 à 10:08, ShreeDevi Kumar  a écrit :
> 
> I will have to look for the exact commands and training text I used at that 
> time.
> 
> You should be able to recreate the training by following instructions given 
> at 
> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters
> 
> I had modified the english langdata files and then finally renamed the 
> traineddata to digits after completing training.
> 
> Create a training text which has digits and signs. 
> 
> Replace the word list to match the kind of number patterns you expect or 
> don't use a word list at all.
> 
> 
> 
> On 04-Jan-2018 12:04 PM, "Thomas Menguy"  wrote:
> Hi Shree, 
> 
> Tried your Data for digits ... really works well!
> Need to do a training set with number and signs for example ... could you 
> point me on how you've done your own training data (sorry fairly new to 
> Tesseract, never trained it before)
> 
> Thanks for your help!
> BR
> 
>> On Tuesday, October 3, 2017 at 6:39:30 PM UTC+2, shree wrote:
>> You can try the plus-minus type of training if you just want a digits type 
>> of traineddata.
>> 
>> Your training_text can contain numbers in the format you need and you can 
>> train with a font matching your images.
>> 
>> For proof of concept you can try my experimental version at 
>> 
>> https://github.com/Shreeshrii/tessdata4alpha/blob/master/fast/digits.traineddata
>> 
>>> On Friday, September 29, 2017 at 12:32:41 PM UTC+5:30, John Miller wrote:
>>> Today,I found that the problem had been  posted on 
>>> https://github.com/tesseract-ocr/tesseract/issues/751
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/5f98dc8f-55e9-46dc-84b2-4ee1c7adc868%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> 
> -- 
> You received this message because you are subscribed to a topic in the Google 
> Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit 
> https://groups.google.com/d/topic/tesseract-ocr/-oeCTcojYfw/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to 
> tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXyCd3RFDA0G%3DXyYtUa6Cft1afT4KRrEx2%3DFhZKq_yS%2BQ%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/58D78AED-8C8D-44C9-9C70-B7BB5B7E19AE%40gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: how to use PDF as Input

2018-01-04 Thread Subhanshu Gupta
Thanks Quan. One more thing, how can I use Tesseract to read a form having 
different data fields like Name, Address, etc. and save the corresponding 
data to somewhere else?


On Thursday, January 4, 2018 at 6:51:48 AM UTC+5:30, Quan Nguyen wrote:
>
> Tesseract engine cannot read PDF. You'll have to convert them to suitable 
> images (TIFF or PNG) first. There are many tools for that: ImageMagick, 
> GhostScript, PDFBox, etc.
>
> On Wednesday, January 3, 2018 at 12:05:12 PM UTC-6, Subhanshu Gupta wrote:
>>
>> Dear All,
>>
>> I am new to Tesseract OCR and need to implement it to Read PDF Forms but 
>> I am not able to find any good documentation for which method to use to 
>> read PDF as well as for Character Segmentation.
>> If any of you have any doc/manual relating on which method is used where 
>> it will be really very helpful.
>>
>> Thanks. :)
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c370c2ea-2b14-4814-aff4-105119b26c65%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Need Help with extracting info from Invoice

2018-01-04 Thread Ha Hien

Hi Djibril,
I am afraid that this is an old topic and he may not work with invoices 
anymore. I am also interested in extracting information from invoices. Have 
you tried to use tesseract with a dictionary
to improve accuracy? Because invoices have some particular data fields. You 
can see the manual here: 
https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#config-files-and-augmenting-with-user-data
Tell me if you have better result. I will also tell you if I have.
Best,

Vào 20:32:50 UTC+1 Thứ Tư, ngày 06 tháng 12 năm 2017, Djibril Kaba đã viết:
>
> Hi Vinay,
>
> I am trying to solve the same problem here. Have you managed to get some 
> solution to your problem. Your help would be greatly appreciated.  Looking 
> forward to hearing from you.
>
> Many thanks!!
>
> On Tuesday, November 18, 2014 at 8:53:08 PM UTC+1, Vinay Matam wrote:
>>
>> Hi All,
>>
>> I really need your help with one of the projects that I am working on. I 
>> am using Tesseract 3.02 on a Ubuntu machine.
>>
>> I have an invoice (please see the attached file). I want to extract some 
>> information from that invoice like Advisor Name, Invoice Number, Invoice 
>> Date, License No, Mileage etc..
>>
>> I have tried to extract the whole data from the image to a text file. By 
>> doing some pre-processing on the image using Imagemagick, I was able to 
>> extract the info to some extent. However, I am not totally satisfied with 
>> the output. 
>> I need your inputs on how I should extract the information. Shall I first 
>> crop the specific portion of the image to different rectangles and then OCR 
>> them individually..? I tried this way and gained great results. But again 
>> in this case, not all the images are in the same size with same resolution 
>> and hence the rectangles co-ordinates will not work on all the cases. I 
>> thought this method will not work on all images (scanned, taken from mobile 
>> or pdf files).
>>
>> Then I thought of using Regular expressions on the extracted data and 
>> then pick up the data that I require from the whole text file. But this 
>> method also does not seem to be working. 
>>
>> I am totally in a confused state now. Any help or inputs are much 
>> appreciated. .. :) I have attached a sample image and the extracted output.
>>
>> Thanks,
>> Vinay.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/06659594-efd0-4d36-a2a0-144d5ef63968%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: How to use tesseract4.0 to only recognize the digits??

2018-01-04 Thread ShreeDevi Kumar
I will have to look for the exact commands and training text I used at that
time.

You should be able to recreate the training by following instructions given
at
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters

I had modified the english langdata files and then finally renamed the
traineddata to digits after completing training.

Create a training text which has digits and signs.

Replace the word list to match the kind of number patterns you expect or
don't use a word list at all.



On 04-Jan-2018 12:04 PM, "Thomas Menguy"  wrote:

Hi Shree,

Tried your Data for digits ... really works well!
Need to do a training set with number and signs for example ... could you
point me on how you've done your own training data (sorry fairly new to
Tesseract, never trained it before)

Thanks for your help!
BR

On Tuesday, October 3, 2017 at 6:39:30 PM UTC+2, shree wrote:
>
> You can try the plus-minus type of training if you just want a digits type
> of traineddata.
>
> Your training_text can contain numbers in the format you need and you can
> train with a font matching your images.
>
> For proof of concept you can try my experimental version at
>
> https://github.com/Shreeshrii/tessdata4alpha/blob/master/fas
> t/digits.traineddata
>
> On Friday, September 29, 2017 at 12:32:41 PM UTC+5:30, John Miller wrote:
>>
>> Today,I found that the problem had been  posted on
>> https://github.com/tesseract-ocr/tesseract/issues/751
>>
> --
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/
msgid/tesseract-ocr/5f98dc8f-55e9-46dc-84b2-4ee1c7adc868%40googlegroups.com

.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXyCd3RFDA0G%3DXyYtUa6Cft1afT4KRrEx2%3DFhZKq_yS%2BQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] tesseract 4 (last) box format

2018-01-04 Thread Amit Man
I understand that for the new tesseract 4, the box format should include a 
space after each word. 
regarding:

3.0 version of box files can be converted for use with LSTM training by 
> adding a tab character at end of each line and boxes with space after each 
> word. Mark EOL and Mark EOL Bulk functions under Edit in Box Editor tab of 
> latest version of jTessBoxEditor - jTessBoxEditor-2.0-Beta can be used to 
> add the EOL tabs automatically. Insert mode can be used on last letter of 
> each word to add a box with space. There is no automated way to do this.


Are the coordinates of those manually inserted "space" lines important or 
it just a mark.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6ce03d08-d18b-4154-91bd-76352c34f7a6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.