Re: [tesseract-ocr] Form Recognizer using Ocr

2019-10-17 Thread Shree Devi Kumar
You can try with uzn files. See https://jsoma.github.io/kull/#/

On Fri, Oct 18, 2019 at 11:03 AM Rahul Dochak 
wrote:

> Hi All,
>
> I have a task and I could see a way to approach this but i do not know
> how to ,what i am trying to do is this:
> I want to make a form recogniser and then extract text from the fields
> inside the forms,the form are in the form of scanned pdf's and i do not
> know the forms or the fields beforehand only knows about the form name .
> I want to scan the pdf and convert it to text and then search for the form
> name and check if I have a predefined template for that form type if not
> then I have to somehow get the location of all the fields as I do not have
> the required fields for a form type,and make a template for future use with
> the same form type and extract the data of the fields to json. I could not
> find a way to make a template on the go for a new form type . Guidance in
> to the right direction will be helpful.
>
> Thanks in advance.
> Rahul.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6edb4f1a-c44c-4f9c-b929-f3079b223eb6%40googlegroups.com
> 
> .
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWgKkfbR2SG-3hZ-DFYVy2WKR_EoH90N65iUqJOUA1PNg%40mail.gmail.com.


[tesseract-ocr] Form Recognizer using Ocr

2019-10-17 Thread Rahul Dochak
Hi All,

I have a task and I could see a way to approach this but i do not know 
how to ,what i am trying to do is this:
I want to make a form recogniser and then extract text from the fields 
inside the forms,the form are in the form of scanned pdf's and i do not 
know the forms or the fields beforehand only knows about the form name .
I want to scan the pdf and convert it to text and then search for the form 
name and check if I have a predefined template for that form type if not 
then I have to somehow get the location of all the fields as I do not have 
the required fields for a form type,and make a template for future use with 
the same form type and extract the data of the fields to json. I could not 
find a way to make a template on the go for a new form type . Guidance in 
to the right direction will be helpful.

Thanks in advance.
Rahul.


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6edb4f1a-c44c-4f9c-b929-f3079b223eb6%40googlegroups.com.


Re: [tesseract-ocr] tesseract data language model sources

2019-10-17 Thread 'abram stern' via tesseract-ocr
thanks, this is exactly what I was looking for! -a

On Thu, Oct 17, 2019 at 9:10 PM Shree Devi Kumar 
wrote:

> See
> https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951
>
>
> On Fri, Oct 18, 2019 at 9:10 AM 'abram stern' via tesseract-ocr <
> tesseract-ocr@googlegroups.com> wrote:
>
>> Hi tesseract community,
>>
>> I'm working on a research project about OCR and I'm wondering where the
>> included data models (eg 'fast', 'best') come from -- or put another way,
>> what source material is used for training them?  I haven't been able to
>> find this documented anywhere and am interested to know if it involves
>> public domain corpora, data obtained through book scanning, or other
>> sources.
>>
>> Best regards,
>> Abram
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/bdb45c2b-1764-4384-95e5-a5d884e2c5ab%40googlegroups.com
>> 
>> .
>>
>
>
> --
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVTYohE9sH%3D6yk7%2BZOCnJ%2B%2Baom0FwnAM4oo0%3DJdcbDDVg%40mail.gmail.com
> 
> .
>


-- 
Abram Stern (aphid)
PhD Candidate, Film and Digital Media
University of California, Santa Cruz
ap...@ucsc.edu // a...@aphid.org ⚛ // (831) 224-0334 <2883129202240334>
(mobile/signal)

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJfQF%2BXjjJ_9fK0VZTtB2GM6QSMn321uXR%2BMq3SnsoTYtFpcSA%40mail.gmail.com.


Re: [tesseract-ocr] tesseract data language model sources

2019-10-17 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/langdata_lstm
has the files used.

On Fri, Oct 18, 2019 at 9:39 AM Shree Devi Kumar 
wrote:

> See
> https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951
>
>
> On Fri, Oct 18, 2019 at 9:10 AM 'abram stern' via tesseract-ocr <
> tesseract-ocr@googlegroups.com> wrote:
>
>> Hi tesseract community,
>>
>> I'm working on a research project about OCR and I'm wondering where the
>> included data models (eg 'fast', 'best') come from -- or put another way,
>> what source material is used for training them?  I haven't been able to
>> find this documented anywhere and am interested to know if it involves
>> public domain corpora, data obtained through book scanning, or other
>> sources.
>>
>> Best regards,
>> Abram
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/bdb45c2b-1764-4384-95e5-a5d884e2c5ab%40googlegroups.com
>> 
>> .
>>
>
>
> --
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVj7UL8hMRD5JgR-Zn6UvhUJeSpxjzFQUg%3D-XW_vV05hg%40mail.gmail.com.


Re: [tesseract-ocr] tesseract data language model sources

2019-10-17 Thread Shree Devi Kumar
See
https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951


On Fri, Oct 18, 2019 at 9:10 AM 'abram stern' via tesseract-ocr <
tesseract-ocr@googlegroups.com> wrote:

> Hi tesseract community,
>
> I'm working on a research project about OCR and I'm wondering where the
> included data models (eg 'fast', 'best') come from -- or put another way,
> what source material is used for training them?  I haven't been able to
> find this documented anywhere and am interested to know if it involves
> public domain corpora, data obtained through book scanning, or other
> sources.
>
> Best regards,
> Abram
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/bdb45c2b-1764-4384-95e5-a5d884e2c5ab%40googlegroups.com
> 
> .
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVTYohE9sH%3D6yk7%2BZOCnJ%2B%2Baom0FwnAM4oo0%3DJdcbDDVg%40mail.gmail.com.


[tesseract-ocr] tesseract data language model sources

2019-10-17 Thread 'abram stern' via tesseract-ocr
Hi tesseract community,

I'm working on a research project about OCR and I'm wondering where the 
included data models (eg 'fast', 'best') come from -- or put another way, 
what source material is used for training them?  I haven't been able to 
find this documented anywhere and am interested to know if it involves 
public domain corpora, data obtained through book scanning, or other 
sources.

Best regards,
Abram

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bdb45c2b-1764-4384-95e5-a5d884e2c5ab%40googlegroups.com.


[tesseract-ocr] Re: java.lang.UnsatisfiedLinkError in Windows Server 2008

2019-10-17 Thread Quan Nguyen
Can you try with version 4.3.1 or the latest version 4.4.1?

On Tuesday, October 15, 2019 at 10:56:14 AM UTC-5, Nuno Feliciano wrote:
>
> Hi,
>
> I am getting an error with Tess4j when I run it in a Windows Server 2008 
> R2 64 bit (tess4j-4.3.0).
>
> Exception in thread "main" java.lang.UnsatisfiedLinkError: 
> at com.sun.jna.Native.open(Native Method)
> at com.sun.jna.Native.open(Native.java:1759)
> at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:260)
> at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:398)
> at com.sun.jna.Library$Handler.(Library.java:147)
> at com.sun.jna.Native.loadLibrary(Native.java:412)
> at com.sun.jna.Native.loadLibrary(Native.java:391)
> at 
> net.sourceforge.tess4j.util.LoadLibs.getTessAPIInstance(LoadLibs.java:85)
> at net.sourceforge.tess4j.TessAPI.(TessAPI.java:42)
> at net.sourceforge.tess4j.Tesseract.init(Tesseract.java:426)
> at net.sourceforge.tess4j.Tesseract.getWords(Tesseract.java:693)
>
> I am using jdk1.8 64 bit
> I don't have the error in a Windows 8.1 Enterprise 64 bit
>
> I have tried using dependecy walker to figure out which dlls I was 
> missing. I tried adding a few (dcomp,vcruntime140,msvcp140 and a few more), 
> but no luck.
>
> Can anyone help?
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/288c2e81-6a76-41fa-b7b9-9fff1cad5d74%40googlegroups.com.


Re: [tesseract-ocr] Tesseract Strangely Thinks Text is Upside Down - ACCURACY

2019-10-17 Thread Lorenzo Bolzani
Maybe a problem with the exif rotation data?

Il giorno gio 17 ott 2019 alle ore 20:12 Umut Barış Korkut <
umut.kor...@gamyte.com> ha scritto:

> Default psm works with these two pages but it does not work with the other
> pages of the document because they have tables and vertical text.
>
> Is it possible to give the orientation of the page to tesseract or Is it
> possible to disable detection of upside down text?
>
>
> On Saturday, October 12, 2019 at 2:27:29 PM UTC+3, zdenop wrote:
>>
>> Do not use psm 12. Default psm seems to work.
>>
>> Zdenko
>>
>>
>> št 10. 10. 2019 o 12:47 Umut Barış Korkut 
>> napísal(a):
>>
>>> Hey,
>>>
>>> Tesseract sometimes thinks all the text in the page is upside down.
>>> For example the text "MOM" is recognized as "WOW" by the tesseract.
>>> Similarly "GENERAL NOTES" is recognized as "SALON IWYANSAD".
>>>
>>> How can I fix this is there any suggestions?
>>>
>>> I have attached 2 similar images here, tesseract is very successful on
>>> one of them and extremely awful on the other one.
>>>
>>> https://drive.google.com/drive/folders/1F328jOAK6fUPX1a79fy9JReAoHgpwQsB?usp=sharing
>>>
>>> I tried with both tesseract 4 and 5 using psm 12 mode.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesser...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/de5c596d-79c9-4314-93fb-7c3f9b0ffb31%40googlegroups.com
>>> 
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/190ac479-a4e5-427e-90b6-5928c7540483%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxpWvVNEAqPcgtmxbcKurdbuLeWZXar0YnGfyR%2BenHbSw%40mail.gmail.com.


Re: [tesseract-ocr] Tesseract Strangely Thinks Text is Upside Down - ACCURACY

2019-10-17 Thread Umut Barış Korkut
Default psm works with these two pages but it does not work with the other 
pages of the document because they have tables and vertical text.

Is it possible to give the orientation of the page to tesseract or Is it 
possible to disable detection of upside down text?


On Saturday, October 12, 2019 at 2:27:29 PM UTC+3, zdenop wrote:
>
> Do not use psm 12. Default psm seems to work.
>
> Zdenko
>
>
> št 10. 10. 2019 o 12:47 Umut Barış Korkut  > napísal(a):
>
>> Hey,
>>
>> Tesseract sometimes thinks all the text in the page is upside down. 
>> For example the text "MOM" is recognized as "WOW" by the tesseract. 
>> Similarly "GENERAL NOTES" is recognized as "SALON IWYANSAD".
>>
>> How can I fix this is there any suggestions?
>>
>> I have attached 2 similar images here, tesseract is very successful on 
>> one of them and extremely awful on the other one.
>>
>> https://drive.google.com/drive/folders/1F328jOAK6fUPX1a79fy9JReAoHgpwQsB?usp=sharing
>>
>> I tried with both tesseract 4 and 5 using psm 12 mode.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesser...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/de5c596d-79c9-4314-93fb-7c3f9b0ffb31%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/190ac479-a4e5-427e-90b6-5928c7540483%40googlegroups.com.