Re: [tesseract-ocr] Problems recognized mixed scripts in Tesseract 4 alpha

2017-08-31 Thread ShreeDevi Kumar
Have you tried the best trained data for Chinese which has English in
addition to Chinese as part of the training. That maybe a better option
than using eng+

On 31-Aug-2017 12:31 PM, "Brendan O'Kane"  wrote:

> Hi all,
>
> Running 'tesseract -l eng+chi_tra' on a scanned page of English text mixed
> with Chinese characters does not detect any Chinese characters at all:
>
> > The five chapters on fiction, memoirs, and other kinds of prose that
> > follow offer as many approaches to our understanding of the transition
> > between 1644 and 1700. Focusing on the lives of Mao Xiang § X (161-
> > 93) and Yu Huai A1% (1616-96), Oki Yasushi develops portraits of these
> > two "romantic Jiangnan loyalists," who clung to patterns of late Ming
> > feeling and aestheticism long after the Ming had fallen. The image of
> > loyalism as romantic is in striking contrast to starker images of
> loyalist
> > experience. Both Mao and Yu are best known for their memoirs, which
> > focus prominently on women, one of the new ways of figuring nos-
> > talgia and resistance in male writings of the early Qing. Robert Hegel's
> > "Dreaming the Past" is similarly concerned with the individual, fo-
> > cusing on Chu Renhuo #ARE (ca. 1630-1705+), as well as his novel,
> > Sui Tang yany: G B® #&, (ca. 1675), but it extends well beyond Chu and
> > his work in contemplating how "the past" (the Tang past in particular)
> > shaped imaginative literature in an era when the present offered little
> > solace.
>
>
> The characters are (mostly) correctly recognized when only 'chi_tra' is
> set as the OCR language, but at the cost of seriously degraded accuracy in
> English OCR:
>
> > The fve chapters on fiction,menoirs, and other kinds of prose thar
> > follow offer as nany approaches to our understanding ofthe transition
> > between :644 and I7oo. Focusing on the |ives of Mao 文 iang 冒 裱 (I6II-
> > 93andYuTiuai 余 懷 ((616-96), OkiYasushidevelops portraits ofthese
> > two "ronantic Jiangnan loyalists"who clung to patterns of ]ate N{ing
> > feeling and aestheticismn long after the Ming had fallen. The of
> > loyalisn as ronantic is in striking contrast to starker 1nages of
> |oyalisr
> > experience. Both Mao and Yu are best known fortheir memolrs, wˇhich
> > focus Proninently on womnen, one of the new ways of figuring nox-
> > talgia and resistance in male writings of the early Cuing. Roberr
> Tiegel's
> > "1reaning the Past" is simnilarly concerned with the individual, fo-
> > cusing on ChuRenhuo 褚 人 穫 (ca. I63o-I7oy+)}, as well a$ his novel,
> > 5#77mzg5227 隋 唐 演 義 (Ca.I67y, butit extends well beyondChu and
> > his work in contemplating how "the past" (the Tang Past in Particulan
> > shaped imaginative ]iterature in an era when Lhe present offered |ittle
> > $olace.
>
>
> Is this a known issue? Am I doing something wrong here?
>
> --Brendan
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/0ed8e7da-72cb-4bb8-8f48-44f8fc76f7c2%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduURJpLOjw9ybZTTKJAxmaxQuJ01c12LNYYjx%3D2432avZQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: error when make

2017-08-30 Thread ShreeDevi Kumar
See https://abi-laboratory.pro/tracker/timeline/tesseract/

and

https://github.com/tesseract-ocr/tesseract/issues/793

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Aug 30, 2017 at 7:27 AM, Carlos Miguens  wrote:

> Hi! I have the same problem here when I include imageprocessor.h
>
> ‘p1_’ was not declared in this scope sudoku line 1005, external location:
> /usr/local/include/tesseract/tesscallback.h C/C++ Problem
>
> Do somebody know what happen? Thanks!
>
> El viernes, 27 de mayo de 2016, 8:23:53 (UTC-3), Dennis Park escribió:
>>
>>
>> It seams not every version is working on my environment, 3.02.02(with
>> tessdata-3.04.00) is working good for me.
>>
>>
>> Dennis
>>
>> On Wednesday, May 25, 2016 at 2:05:54 PM UTC+8, Dennis Park wrote:
>>>
>>> hi, guys:
>>>
>>> configure is ok, when running make, I got the following error:
>>> any idea why?
>>>
>>> Thank in advance.
>>> Dennis
>>>
>>>
>>> /bin/sh ../libtool  --tag=CXX   --mode=compile g++ -DHAVE_CONFIG_H -I.
>>> -I..  -O2 -DNDEBUG -DUSE_STD_NAMESPACE -I../ccutil -I../ccstruct
>>> -I../viewer -I../classify -I../dict -I../wordrec -I../cutil -I../textord
>>> -I../opencl   -I../neural_networks/runtime -I../cube
>>> -I/home/work/.jumbo/include -I/home/work/.jumbo/include/leptonica  -g
>>> -O2 -MT cubeclassifier.lo -MD -MP -MF .deps/cubeclassifier.Tpo -c -o
>>> cubeclassifier.lo cubeclassifier.cpp
>>> libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I.. -O2 -DNDEBUG
>>> -DUSE_STD_NAMESPACE -I../ccutil -I../ccstruct -I../viewer -I../classify
>>> -I../dict -I../wordrec -I../cutil -I../textord -I../opencl
>>> -I../neural_networks/runtime -I../cube -I/home/work/.jumbo/include
>>> -I/home/work/.jumbo/include/leptonica -g -O2 -MT cubeclassifier.lo -MD
>>> -MP -MF .deps/cubeclassifier.Tpo -c cubeclassifier.cpp  -fPIC -DPIC -o
>>> .libs/cubeclassifier.o
>>> libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I.. -O2 -DNDEBUG
>>> -DUSE_STD_NAMESPACE -I../ccutil -I../ccstruct -I../viewer -I../classify
>>> -I../dict -I../wordrec -I../cutil -I../textord -I../opencl
>>> -I../neural_networks/runtime -I../cube -I/home/work/.jumbo/include
>>> -I/home/work/.jumbo/include/leptonica -g -O2 -MT cubeclassifier.lo -MD
>>> -MP -MF .deps/cubeclassifier.Tpo -c cubeclassifier.cpp -o cubeclassifier.o
>>> >/dev/null 2>&1
>>> mv -f .deps/cubeclassifier.Tpo .deps/cubeclassifier.Plo
>>> /bin/sh ../libtool  --tag=CXX   --mode=compile g++ -DHAVE_CONFIG_H -I.
>>> -I..  -O2 -DNDEBUG -DUSE_STD_NAMESPACE -I../ccutil -I../ccstruct
>>> -I../viewer -I../classify -I../dict -I../wordrec -I../cutil -I../textord
>>> -I../opencl   -I../neural_networks/runtime -I../cube
>>> -I/home/work/.jumbo/include -I/home/work/.jumbo/include/leptonica  -g
>>> -O2 -MT tesseract_cube_combiner.lo -MD -MP -MF
>>> .deps/tesseract_cube_combiner.Tpo -c -o tesseract_cube_combiner.lo
>>> tesseract_cube_combiner.cpp
>>> libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I.. -O2 -DNDEBUG
>>> -DUSE_STD_NAMESPACE -I../ccutil -I../ccstruct -I../viewer -I../classify
>>> -I../dict -I../wordrec -I../cutil -I../textord -I../opencl
>>> -I../neural_networks/runtime -I../cube -I/home/work/.jumbo/include
>>> -I/home/work/.jumbo/include/leptonica -g -O2 -MT
>>> tesseract_cube_combiner.lo -MD -MP -MF .deps/tesseract_cube_combiner.Tpo
>>> -c tesseract_cube_combiner.cpp  -fPIC -DPIC -o
>>> .libs/tesseract_cube_combiner.o
>>> libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I.. -O2 -DNDEBUG
>>> -DUSE_STD_NAMESPACE -I../ccutil -I../ccstruct -I../viewer -I../classify
>>> -I../dict -I../wordrec -I../cutil -I../textord -I../opencl
>>> -I../neural_networks/runtime -I../cube -I/home/work/.jumbo/include
>>> -I/home/work/.jumbo/include/leptonica -g -O2 -MT
>>> tesseract_cube_combiner.lo -MD -MP -MF .deps/tesseract_cube_combiner.Tpo
>>> -c tesseract_cube_combiner.cpp -o tesseract_cube_combiner.o >/dev/null 2>&1
>>> mv -f .deps/tesseract_cube_combiner.Tpo .deps/tesseract_cube_combiner.
>>> Plo
>>> /bin/sh ../libtool  --tag=CXX   --mode=link g++  -g -O2
>>>  -L/home/work/.jumbo/lib -o libtesseract_main.la  adaptions.lo
>>> applybox.lo control.lo docqual.lo equationdetect.lo fixspace.lo fixxht.lo
>>> ltrresultiterator.lo osdetect.lo output.lo pageiterator.lo pagesegmain.lo
>>> pagewalk.lo par_control.lo paragraphs.lo paramsd.lo pgedit.lo
>>> recogtraining.lo reject.lo resultiterator.lo superscript.lo tessbox.lo
>>> tessedit.lo tesseractclass.lo tessvars.lo tfacepp.lo thresholder.lo
>>> werdit.lo cube_control.lo cube_reco_context.lo cubeclassifier.lo
>>> tesseract_cube_combiner.lo  -llept -lpthread
>>> libtool: link: ar cru .libs/libtesseract_main.a .libs/adaptions.o
>>> .libs/applybox.o .libs/control.o .libs/docqual.o .libs/equationdetect.o
>>> .libs/fixspace.o .libs/fixxht.o .libs/ltrresultiterator.o .libs/osdetect.o
>>> .libs/output.o .libs/pageiterator.o .libs/pagesegmain.o .libs/pagewalk.o
>>> .libs/par_control.o .libs/paragraphs.o .libs/paramsd.o .libs/pgedit.o
>>> 

Re: [tesseract-ocr] Spanish text better processed in eng than in spa

2017-08-29 Thread ShreeDevi Kumar
I have opened this as an issue at https://github.com/tesserac
t-ocr/tessdata/issues/77

You can provide additional feedback there.

@theraysmith is doing the training at Google.  The examples you provide
will be helpful to him and improve future training.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Aug 29, 2017 at 7:38 PM,  wrote:

> spa and latin within best folders are moreless equivalent, there is no
> significant difference, although there are several failures they are quite
> reasonable. The one that provide real bad output are the official ones that
> are automatically installed.
>
> Do you need help training the data? (is a neural network?) I can provide
> examples.
>
> El martes, 29 de agosto de 2017, 3:17:40 (UTC+2), shree escribió:
>>
>> I had not checked the list.
>>
>> It should actually be Latin.traineddata for all languages written in
>> Latin script. Not Spanish, as I had written.
>>
>> On 29-Aug-2017 3:54 AM,  wrote:
>>
>>> So... I have installed the default tessdata used by the installer, which
>>> seems to be this one: https://github.com/tesser
>>> act-ocr/tessdata/blob/master/spa.traineddata
>>>
>>> Looking to your comment I have installed the package:
>>> https://github.com/tesseract-ocr/tessdata/blob/mast
>>> er/best/spa.traineddata
>>>
>>> But I have not found best/Spanish, is it missing in the upload?
>>>
>>> The best/spa is REALLY better and comparable quality when compared to
>>> english, the have moreless the same level of errors.
>>>
>>> Where is best/Spanish, looking to the effect I am really interested in
>>> testing it.
>>>
>>> Btw, is there any way to tell tesseract that values are in a table, so
>>> that it will not make a mistake identifying lines with charts?
>>>
>>> El lunes, 28 de agosto de 2017, 8:15:41 (UTC+2), shree escribió:

 Have you tried with the 'best' traineddatas?

 What about results using best/Spanish vs best/spa?

 I have opened this as an issue at https://github.com/tesserac
 t-ocr/tessdata/issues/77

 You can provide additional feedback there.

 ShreeDevi
 
 भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

 On Mon, Aug 28, 2017 at 6:04 AM,  wrote:

> So... after following the instructions from quality improvement:
> https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality I
> found what I think is a nice picture, I attach you tessinput.tif file I
> received as output.
>
> When I ran tesseract 4.0.0 on the image I found that actually the eng
> version is providing a better nicer version of the analysis than the
> spanish version.
>
> What can I do? I actually have seen recurrent errors with the same
> chart.
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/b1efae89-d9d
> 5-4970-9b3e-5e29f9dd6620%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

 --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/0299357d-0026-4a7a-8cfa-921094a0c25e%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/cf07113f-e581-4cd0-bf8e-050a8b8dc3a0%
> 

Re: [tesseract-ocr] Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.

2017-08-29 Thread ShreeDevi Kumar
Also see https://github.com/tesseract-ocr/tesseract/issues/221

On 29-Aug-2017 3:26 PM, "ShreeDevi Kumar" <shreesh...@gmail.com> wrote:

> Check where the osd.traineddata and eng.trsineddata are installed.
> Download other trained data to same directory.
>
> On Linux, it is usually /use/share/tessdata
>
> On 29-Aug-2017 1:58 PM, "vikram charan" <vc7...@gmail.com> wrote:
>
>> Hello,
>> I'm working on project which base on scan many kind of documents (like: -
>> Image that contain text, file, inquiry forms, documents etc.) . I'm using
>> Tesseract library to scan these documents. As mention on Github i followed
>> all step to setup Tesseract. I drag and drop tessdata folder in project. I
>> also download language traineddata from Github and put in my project
>> because my project support 55 languages and it is offline.
>>
>> Now I run project and scan some document. Instead of English, french,
>> other languages not scan my documents. some time arabic language also work
>> but each time. and my app crash i got this error
>>
>> "Please make sure the TESSDATA_PREFIX environment variable is set to the
>> parent directory of your "tessdata" directory.
>> Failed loading language 'ara'
>> Tesseract couldn't load any languages!"
>>
>> while i'm add all 55 languages trained data into my project and create
>> .ipa it's size is 205MB that is not good for my project. To over come this
>> problem i upload all trained data to server. and download language
>> according to user choice. Trained Data files download successfully into
>> document directory. but when i run project then it give me error:-
>>
>> Error opening data file file:///var/mobile/Containers/
>> Data/Application/7EC1EE90-08A4-41BD-A787-4FD58E7E6575/Docume
>> nts/tessdata/ara.traineddata
>> Please make sure the TESSDATA_PREFIX environment variable is set to the
>> parent directory of your "tessdata" directory.
>> Failed loading language 'ara'
>> Tesseract couldn't load any languages!
>>
>> Please help me asap so i can finished this project.
>>
>>
>> Thank you
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/4009683a-d855-47e7-b090-0ed06ed7e8a6%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/4009683a-d855-47e7-b090-0ed06ed7e8a6%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU6YWax6R89GhYev-YNZ0JvJ7KsWFH%2BqbAjkeY%2B0rRcRQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.

2017-08-29 Thread ShreeDevi Kumar
Check where the osd.traineddata and eng.trsineddata are installed. Download
other trained data to same directory.

On Linux, it is usually /use/share/tessdata

On 29-Aug-2017 1:58 PM, "vikram charan"  wrote:

> Hello,
> I'm working on project which base on scan many kind of documents (like: -
> Image that contain text, file, inquiry forms, documents etc.) . I'm using
> Tesseract library to scan these documents. As mention on Github i followed
> all step to setup Tesseract. I drag and drop tessdata folder in project. I
> also download language traineddata from Github and put in my project
> because my project support 55 languages and it is offline.
>
> Now I run project and scan some document. Instead of English, french,
> other languages not scan my documents. some time arabic language also work
> but each time. and my app crash i got this error
>
> "Please make sure the TESSDATA_PREFIX environment variable is set to the
> parent directory of your "tessdata" directory.
> Failed loading language 'ara'
> Tesseract couldn't load any languages!"
>
> while i'm add all 55 languages trained data into my project and create
> .ipa it's size is 205MB that is not good for my project. To over come this
> problem i upload all trained data to server. and download language
> according to user choice. Trained Data files download successfully into
> document directory. but when i run project then it give me error:-
>
> Error opening data file file:///var/mobile/Containers/
> Data/Application/7EC1EE90-08A4-41BD-A787-4FD58E7E6575/
> Documents/tessdata/ara.traineddata
> Please make sure the TESSDATA_PREFIX environment variable is set to the
> parent directory of your "tessdata" directory.
> Failed loading language 'ara'
> Tesseract couldn't load any languages!
>
> Please help me asap so i can finished this project.
>
>
> Thank you
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/4009683a-d855-47e7-b090-0ed06ed7e8a6%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWCYnd9Bpam4NymXpnH-Mq6QvpL%2BUYWzFymuKqzZdHR5w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] tesseract is not working for straightforward image

2017-08-29 Thread ShreeDevi Kumar
Take a look at improve quality page in wiki.

On 28-Aug-2017 6:16 PM, "Lada Tylich"  wrote:

> Hi,
> I am confused that for the attached image it gives  with parameter *-psm
> 7* result *88C. *It should detect such a picture, I guess.
> Am I missing something something?
>
> Thanks for any response.
>
> P.S.: Maybe sorry for duplicate (that is the 2nd post, because I have lost
> the first one.. :/. - if you find it, this one can be deleted)
>
> Regards!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/82d5b329-b27e-4250-8dd8-4c0c9daed826%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUH1_y%3DUsBi%3Duux3MX_A79Cc3SDwEyHN2xDRNTS4wZR%2BQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Tesseract OCR 4.0.0 Alpha how to train a new font

2017-08-29 Thread ShreeDevi Kumar
Try first with

best/Latin.traineddata

that should handle text with diacritics

---

>>Pango suggested font Gandhari Unicode.

Use "Gandhari Unicode" within quotes as Font name

>>ERROR: Could not find training text file /usr/local/share/tessdata//
eng/eng.training_text

give script_dir link to langdata folder where you have your training text

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Aug 29, 2017 at 11:58 AM, Anand Akella 
wrote:

> Hi,
> Im new to tesseract and have a pdf file with diacritical marks. I tried to
> run tesseract 4.0.0 with language eng. I see that it is not able to
> recognize the text with diacritical marks. I found a font that can detect
> diacritical mark.
>
> Gandhari Unicode 5.1
> 
>
> I tried to extract the fonts files and copied to /home/tesseract/Downloads/
> fonts
>
> Whenever i try to run tesstrain.sh it gives me an error "could not find
> font named gandhariunicode"
>
> ./tesstrain.sh --fontlist 'gandhariunicode' --fonts_dir
> /home/tesseract/Downloads/fonts/ --lang eng --langdata_dir
> /usr/local/share/tessdata/ --overwrite
>
> === Starting training for language 'eng'
> [Mon Aug 28 23:18:12 PDT 2017] /usr/local/bin/text2image
> --fonts_dir=/home/tesseract/Downloads/fonts/ --font=gandhariunicode
> --outputbase=/tmp/font_tmp.C9vSySTfge/sample_text.txt
> --text=/tmp/font_tmp.C9vSySTfge/sample_text.txt
> --fontconfig_tmpdir=/tmp/font_tmp.C9vSySTfge
> Could not find font named gandhariunicode.
> Pango suggested font Gandhari Unicode.
> Please correct --font arg.
>
> === Phase I: Generating training images ===
> ERROR: Could not find training text file /usr/local/share/tessdata//
> eng/eng.training_text
>
> What could the issue please let me know. Thanks in advance.
>
> Thanks,
> Anand
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/ca874bc1-1458-49da-bf07-005aacd7d582%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVvNa%3DzGWHvZJ6aOa8r2x7frtPrrQ_P1oxV0U7xOmAhuA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Spanish text better processed in eng than in spa

2017-08-28 Thread ShreeDevi Kumar
>Btw, is there any way to tell tesseract that values are in a table, so
that it will not make a mistake identifying lines with charts?

I don't think tesseract has that ability.

You will need to preprocess the image to remove lines. Leptonica has
functions to do that, as well as a table detector.

See https://github.com/DanBloomberg/leptonica/commits/master



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Aug 29, 2017 at 6:47 AM, ShreeDevi Kumar <shreesh...@gmail.com>
wrote:

> I had not checked the list.
>
> It should actually be Latin.traineddata for all languages written in Latin
> script. Not Spanish, as I had written.
>
> On 29-Aug-2017 3:54 AM, <valentin.depa...@gmail.com> wrote:
>
>> So... I have installed the default tessdata used by the installer, which
>> seems to be this one: https://github.com/tesser
>> act-ocr/tessdata/blob/master/spa.traineddata
>>
>> Looking to your comment I have installed the package:
>> https://github.com/tesseract-ocr/tessdata/blob/mast
>> er/best/spa.traineddata
>>
>> But I have not found best/Spanish, is it missing in the upload?
>>
>> The best/spa is REALLY better and comparable quality when compared to
>> english, the have moreless the same level of errors.
>>
>> Where is best/Spanish, looking to the effect I am really interested in
>> testing it.
>>
>> Btw, is there any way to tell tesseract that values are in a table, so
>> that it will not make a mistake identifying lines with charts?
>>
>> El lunes, 28 de agosto de 2017, 8:15:41 (UTC+2), shree escribió:
>>>
>>> Have you tried with the 'best' traineddatas?
>>>
>>> What about results using best/Spanish vs best/spa?
>>>
>>> I have opened this as an issue at https://github.com/tesserac
>>> t-ocr/tessdata/issues/77
>>>
>>> You can provide additional feedback there.
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Mon, Aug 28, 2017 at 6:04 AM, <valentin...@gmail.com> wrote:
>>>
>>>> So... after following the instructions from quality improvement:
>>>> https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality I found
>>>> what I think is a nice picture, I attach you tessinput.tif file I received
>>>> as output.
>>>>
>>>> When I ran tesseract 4.0.0 on the image I found that actually the eng
>>>> version is providing a better nicer version of the analysis than the
>>>> spanish version.
>>>>
>>>> What can I do? I actually have seen recurrent errors with the same
>>>> chart.
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>>> gid/tesseract-ocr/b1efae89-d9d5-4970-9b3e-5e29f9dd6620%40goo
>>>> glegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b1efae89-d9d5-4970-9b3e-5e29f9dd6620%40googlegroups.com?utm_medium=email_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/0299357d-0026-4a7a-8cfa-921094a0c25e%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/0299357d-0026-4a7a-8cfa-921094a0c25e%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWPkn6kWe7pnQ7W3%2Bi542juyKECM08M_7mBp0R7ZPXzbA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Spanish text better processed in eng than in spa

2017-08-28 Thread ShreeDevi Kumar
I had not checked the list.

It should actually be Latin.traineddata for all languages written in Latin
script. Not Spanish, as I had written.

On 29-Aug-2017 3:54 AM,  wrote:

> So... I have installed the default tessdata used by the installer, which
> seems to be this one: https://github.com/tesseract-ocr/tessdata/blob/
> master/spa.traineddata
>
> Looking to your comment I have installed the package: https://github.com/
> tesseract-ocr/tessdata/blob/master/best/spa.traineddata
>
> But I have not found best/Spanish, is it missing in the upload?
>
> The best/spa is REALLY better and comparable quality when compared to
> english, the have moreless the same level of errors.
>
> Where is best/Spanish, looking to the effect I am really interested in
> testing it.
>
> Btw, is there any way to tell tesseract that values are in a table, so
> that it will not make a mistake identifying lines with charts?
>
> El lunes, 28 de agosto de 2017, 8:15:41 (UTC+2), shree escribió:
>>
>> Have you tried with the 'best' traineddatas?
>>
>> What about results using best/Spanish vs best/spa?
>>
>> I have opened this as an issue at https://github.com/tesserac
>> t-ocr/tessdata/issues/77
>>
>> You can provide additional feedback there.
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Mon, Aug 28, 2017 at 6:04 AM,  wrote:
>>
>>> So... after following the instructions from quality improvement:
>>> https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality I found
>>> what I think is a nice picture, I attach you tessinput.tif file I received
>>> as output.
>>>
>>> When I ran tesseract 4.0.0 on the image I found that actually the eng
>>> version is providing a better nicer version of the analysis than the
>>> spanish version.
>>>
>>> What can I do? I actually have seen recurrent errors with the same chart.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/b1efae89-d9d5-4970-9b3e-5e29f9dd6620%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/0299357d-0026-4a7a-8cfa-921094a0c25e%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXtNAkpk_D6Gxtm3x3FqP%2B%3DKuMv_KkSjLuZfMrd67LqWg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Error:Assert failed:in file ../lstm/lstmtrainer.h, line 110

2017-08-28 Thread ShreeDevi Kumar
Please see
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact

The following command extracts the .lstm file from the .traineddata file.

training/combine_tessdata -e tessdata/best/eng.traineddata \
  ~/tesstutorial/impact_from_full/eng.lstm


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Aug 28, 2017 at 3:01 PM, Ava Nimaee  wrote:

> Hi shree
> I read instructions on the training wiki page but i dont have eng.lstm
> non of the syntaxs create eng.lstm. how can i create it. even i check
> langdata which i download it form git amd there is't there.
> i spend alot of time but i don't khonw how i can create it.
> can you tell me.
>
> On Monday, August 21, 2017 at 7:41:41 PM UTC+4:30, shree wrote:
>>
>> lstm file is the language model. It is saved in traineddata file.
>>
>> dawgs are a kind of compressed files, created from lists of words,
>> punctuation or numbers.
>>
>> You can use dawg2wordlist to unpack them.
>>
>> Please follow the instructions on the training wiki page.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/b671d71c-181d-4cac-8def-122c74a0af12%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUAtupdX4spDuC092W6FPzjf2XcmLjSvTEiWadjD9_8ig%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Calling Resource sha1 is disabled! Use Resource sha256 instead Error while installing tesseract in mac

2017-08-28 Thread ShreeDevi Kumar
Try

$ brew update
$ brew install tesseract --HEAD


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Aug 28, 2017 at 12:33 PM, Mahesh Mesta 
wrote:

>  Hello,
>
> up votedown votefavorite
> 
>
> I am trying to install tesseract in mac to work with ruby gem
> tesseract-ocr. However, it seems like the newest version of the tesseract
> is not supported by the gem. Hence I tried the following:
>
> brew install
> https://raw.githubusercontent.com/Homebrew/homebrew/8ba134eda537d2cee7daa7ebdd9f728389d9c53e/Library/Formula/tesseract.rb
>
> it throws the following error:
>
> Error: Calling Resource#sha1 is disabled!Use Resource#sha256 
> instead./Users/maheshmesta/Library/Caches/Homebrew/Formula/tesseract.rb:123:in
>  `block
> (2 levels) in '
>
> How can I rectify this issue?
>
> Regards,
>
> Mahesh
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/430f84fe-29d0-4b33-8c95-d97884f948a7%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWU7DB4TdpJ7U%3DuihAk9Zx_tHq%3DJKxSFBOSiQBWFRevSQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Spanish text better processed in eng than in spa

2017-08-28 Thread ShreeDevi Kumar
Have you tried with the 'best' traineddatas?

What about results using best/Spanish vs best/spa?

I have opened this as an issue at
https://github.com/tesseract-ocr/tessdata/issues/77

You can provide additional feedback there.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Aug 28, 2017 at 6:04 AM,  wrote:

> So... after following the instructions from quality improvement:
> https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality I found
> what I think is a nice picture, I attach you tessinput.tif file I received
> as output.
>
> When I ran tesseract 4.0.0 on the image I found that actually the eng
> version is providing a better nicer version of the analysis than the
> spanish version.
>
> What can I do? I actually have seen recurrent errors with the same chart.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/b1efae89-d9d5-4970-9b3e-5e29f9dd6620%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX81UdEKd2rDCahVNkn3660WCS6PD%3DLPzD1Y4nnshJygQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] error while loading shared libraries: libtesseract.so.4: cannot open shared object file: No such file or directory

2017-08-27 Thread ShreeDevi Kumar
Did you do

sudo ldconfig

And try to run tesseract after that.

On 27-Aug-2017 7:53 PM, "Dan9er"  wrote:

> PATH=/home/dan9er/bin:/home/dan9er/.local/bin:/usr/local/
> sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/
> games:/usr/local/games:/snap/bin:/usr/lib/jvm/java-8-
> oracle/bin:/usr/lib/jvm/java-8-oracle/db/bin:/usr/lib/jvm/
> java-8-oracle/jre/bin
>
> Ok, now how do I add libtesseract.so?
>
> On Sunday, August 27, 2017 at 10:15:20 AM UTC-4, shree wrote:
>>
>> Try
>>
>>
>> sudo ldconfig
>>
>>
>> --
>>
>> type
>>
>> env
>>
>> to see your environment variables, including PATH
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/6ec72b47-bb9e-4e04-8d24-6b486cad2ea2%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVLuEHSHdd%3DDdqNkH_AUw57j%2BV5CXAnEWGcTmtsP1hsBA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] error while loading shared libraries: libtesseract.so.4: cannot open shared object file: No such file or directory

2017-08-27 Thread ShreeDevi Kumar
Try


sudo ldconfig


--

type

env

to see your environment variables, including PATH

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVnMW5TBNCBvOBAsbJriQ%2BQzzLO5wHqw2wx7VJRQapCFA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] error while loading shared libraries: libtesseract.so.4: cannot open shared object file: No such file or directory

2017-08-27 Thread ShreeDevi Kumar
Do a search on libtesseract.so in your console.txt.

See if the path where it has been installed is available when you run
tesseract. Otherwise add it to your PATh environment variable.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Aug 27, 2017 at 12:54 AM, Dan9er  wrote:

> I'm trying to install Tesseract with training tools, and I'm getting this
> error. I pasted my console log here (too big for pastebin): https://drive.
> google.com/open?id=0B0YopyBgBXqNd05zWFk2NFFUUjg
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/fd514889-6026-4a17-b880-d14396e5434d%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWsaQNtJCoYfLXZEzYyv8UWM4b5sOy%2BB5CW-ciJ_o1-XA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Does unicharset affect recognition quality ?

2017-08-25 Thread ShreeDevi Kumar
I do not know about internal working of tesseract.

If you unpack the best/kan.traineddata you may find a smaller unicharset
which just the basic characters in it.

Tesseract 4 uses the LSTM neural net engine vs the legacy engine for 3.05.
LSTM does line based recognition rather than character base.

Yes, it is possible to have both versions installed, however I do not have
exact instructions to make it work. It would also depend on what o/s you
are using.

I only have the latest GitHub version installed.

On 25-Aug-2017 9:46 PM, "Yury" <yura...@gmail.com> wrote:

> ShreeDevi,
>
> Thanks for your answers and taking the time.
>
> I get traineddata file for 3.04 version (file is little less, but number
> of characters is the same - 2851) and get the same result - some symbols is
> divided to pair (first is correct and another one is fail).
> I think to upgrade to 4.00, so I have a questions:
>
> Can I install new version nearby with 3.05, without install ?
>
> And another question in the first my post:
> Did the tesseract have some limitations for number of bytes per character
> in unicode ?
> Are there any additional parameters to remove limitations on the number of
> bytes per symbol ?
>
> пятница, 25 августа 2017 г., 20:13:22 UTC+7 пользователь shree написал:
>>
>> If you are using the 4.0alpha - latest version of program you can use
>> kannada traineddata from
>>
>> https://github.com/tesseract-ocr/tessdata/blob/master/best/k
>> an.traineddata
>> or
>> https://github.com/tesseract-ocr/tessdata/blob/master/best/K
>> annada.traineddata
>>
>> I have not tested kannada personally but if it follows the pattern for
>> devanagari, it should be better than the older traineddata.
>>
>> If you are using 3.05 version of program,
>> then use traineddata files from
>> https://github.com/tesseract-ocr/tessdata/releases/tag/3.04.00
>>
>> Please note that the unicharset and langdata files are used while
>> training and just changing the unicharset file is NOT going to improve the
>> recognition.
>>
>> For that training needs to be done. Please see the wiki for more details.
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Fri, Aug 25, 2017 at 6:31 PM, Yury <yur...@gmail.com> wrote:
>>
>>> Hello shree!
>>>
>>> Thanks for your links and taking the time.
>>>
>>> I don't found folder /best/ in ~alex-p profile.
>>> But I found kan.traineddata in package tesseract-lang-4.00 (in
>>> tesseract-lang-3.05 the language Kannada is absent).
>>> I have to got this file and start recognise - result is the same.
>>> This package is dated at 08.01.17 and have 2851 characters (as I have).
>>> So, I thing I used this package earlier.
>>>
>>> пятница, 25 августа 2017 г., 18:56:25 UTC+7 пользователь shree написал:
>>>>
>>>> https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr
>>>>
>>>> For ppa
>>>>
>>>> On 25-Aug-2017 5:22 PM, "ShreeDevi Kumar" <shree...@gmail.com> wrote:
>>>>
>>>>> Latest GitHub source in master branch is for 4.0alpha. you can install
>>>>> via post.
>>>>>
>>>>> Search for tesseract PPA Alex in Google.
>>>>>
>>>>> _sent from phone
>>>>>
>>>>> On 25-Aug-2017 4:42 PM, "Yury" <yur...@gmail.com> wrote:
>>>>>
>>>>>> Hello again.
>>>>>>
>>>>>> I found this: https://github.com/tesseract-ocr/tessdata/blob/master/
>>>>>> best/Kannada.traineddata
>>>>>>
>>>>>> But after recognition I see only english text symbols and digits, so
>>>>>> this did not work.
>>>>>> In log I see:
>>>>>>  theraysmith <https://github.com/theraysmith> Added best
>>>>>> traineddatas for 4.00 alpha
>>>>>> <https://github.com/tesseract-ocr/tessdata/commit/3a94ddd47be01fd897cbe31f05cbd2301454cf8a>
>>>>>>
>>>>>> I have 3.05.
>>>>>>
>>>>>>
>>>>>> пятница, 25 августа 2017 г., 17:47:56 UTC+7 пользователь Yury написал:
>>>>>>>
>>>>>>> Hello, shree!
>>>>>>>
>>>>>>> Can you tell me exact path for tessdata/best/*.traineddata ?
>>>>>>>
>>>>>>> пятница

Re: [tesseract-ocr] Re: Does unicharset affect recognition quality ?

2017-08-25 Thread ShreeDevi Kumar
If you are using the 4.0alpha - latest version of program you can use
kannada traineddata from

https://github.com/tesseract-ocr/tessdata/blob/master/best/kan.traineddata
or
https://github.com/tesseract-ocr/tessdata/blob/master/best/Kannada.traineddata

I have not tested kannada personally but if it follows the pattern for
devanagari, it should be better than the older traineddata.

If you are using 3.05 version of program,
then use traineddata files from
https://github.com/tesseract-ocr/tessdata/releases/tag/3.04.00

Please note that the unicharset and langdata files are used while training
and just changing the unicharset file is NOT going to improve the
recognition.

For that training needs to be done. Please see the wiki for more details.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Aug 25, 2017 at 6:31 PM, Yury <yura...@gmail.com> wrote:

> Hello shree!
>
> Thanks for your links and taking the time.
>
> I don't found folder /best/ in ~alex-p profile.
> But I found kan.traineddata in package tesseract-lang-4.00 (in
> tesseract-lang-3.05 the language Kannada is absent).
> I have to got this file and start recognise - result is the same.
> This package is dated at 08.01.17 and have 2851 characters (as I have).
> So, I thing I used this package earlier.
>
> пятница, 25 августа 2017 г., 18:56:25 UTC+7 пользователь shree написал:
>>
>> https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr
>>
>> For ppa
>>
>> On 25-Aug-2017 5:22 PM, "ShreeDevi Kumar" <shree...@gmail.com> wrote:
>>
>>> Latest GitHub source in master branch is for 4.0alpha. you can install
>>> via post.
>>>
>>> Search for tesseract PPA Alex in Google.
>>>
>>> _sent from phone
>>>
>>> On 25-Aug-2017 4:42 PM, "Yury" <yur...@gmail.com> wrote:
>>>
>>>> Hello again.
>>>>
>>>> I found this: https://github.com/tesseract-ocr/tessdata/blob/master/
>>>> best/Kannada.traineddata
>>>>
>>>> But after recognition I see only english text symbols and digits, so
>>>> this did not work.
>>>> In log I see:
>>>>  theraysmith <https://github.com/theraysmith> Added best traineddatas
>>>> for 4.00 alpha
>>>> <https://github.com/tesseract-ocr/tessdata/commit/3a94ddd47be01fd897cbe31f05cbd2301454cf8a>
>>>>
>>>> I have 3.05.
>>>>
>>>>
>>>> пятница, 25 августа 2017 г., 17:47:56 UTC+7 пользователь Yury написал:
>>>>>
>>>>> Hello, shree!
>>>>>
>>>>> Can you tell me exact path for tessdata/best/*.traineddata ?
>>>>>
>>>>> пятница, 25 августа 2017 г., 16:07:49 UTC+7 пользователь shree написал:
>>>>>>
>>>>>> Have you tried the new tessdata/best/*.traineddata with the latest
>>>>>> github sources?
>>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>>> gid/tesseract-ocr/b20f906b-db90-43f1-b9c6-b1bb40d21414%40goo
>>>> glegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b20f906b-db90-43f1-b9c6-b1bb40d21414%40googlegroups.com?utm_medium=email_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/b1d6b3c7-79b8-4308-9ac0-7ec1f4e3897c%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/b1d6b3c7-79b8-4308-9ac0-7ec1f4e3897c%40googlegroups.com?utm_medium=email_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWUyQL6UiTFSFGRK0DeepeTZJFwJtmkqf8yJwFjn14Jjw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Dropped single character words

2017-08-25 Thread ShreeDevi Kumar
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

Rescaling to 300 dpi is also helpful.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Aug 25, 2017 at 5:44 PM, Clinton Graham <ctgra...@pitt.edu> wrote:

> Thanks for the suggestion.  The 4.0 alpha does seem to be providing better
> results out of the box.  I pulled the Windows installer:
> tesseract 4.00.00alpha
>  leptonica-1.74.1
>   libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 :
> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0
>
> Enjoy,
>
>
> - Clinton Graham
> Systems Developer
> University of Pittsburgh | University Library System
> 412-383-1057 <(412)%20383-1057>
>
> On Friday, August 25, 2017 at 7:54:25 AM UTC-4, shree wrote:
>>
>> https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr
>>
>> For the ppa
>>
>> On 25-Aug-2017 12:45 AM, "ShreeDevi Kumar" <shree...@gmail.com> wrote:
>>
>>> There is an unofficial ppa package available with latest code, if you do
>>> not want to build it.
>>>
>>> -- Excuse the brevity, msg sent from phone.
>>>
>>> On 25-Aug-2017 12:41 AM, "ShreeDevi Kumar" <shree...@gmail.com> wrote:
>>>
>>>> You can try building latest GitHub source for 4.0alpha and test with
>>>> the best/eng.traineddata from the tessdata repository.
>>>>
>>>> -- Excuse the brevity, msg sent from phone.
>>>>
>>>> On 25-Aug-2017 12:36 AM, "Clinton Graham" <ctgr...@pitt.edu> wrote:
>>>>
>>>>> Do you have any simple suggestions for improving OCR quality where
>>>>> tesseract is missing single character words like "a" and "I"?
>>>>>
>>>>> I'm using the default packages available in Ubuntu:
>>>>> tesseract 3.03
>>>>>  leptonica-1.70
>>>>>   libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib
>>>>> 1.2.8 : webp 0.4.0
>>>>>
>>>>> I've also tried updating Ubuntu, building later 3.x sources:
>>>>> tesseract 3.05.01
>>>>>  leptonica-1.74.4
>>>>>   libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 :
>>>>> zlib 1.2.8
>>>>>
>>>>> I'm using a command line run of simply:
>>>>> tesseract -psm 1 -l eng $f $f pdf
>>>>>
>>>>> I've also tried -psm 6 based on another forum post (though some of my
>>>>> input will be multicolumn).
>>>>>
>>>>> In whatever case, the first paragraph of the my TIFF (attached) is
>>>>> consistently read without instances of single character words:
>>>>>
>>>>> Honors Award {Presentation to Robert H. Ivy, M.D., D.D.S., Sc.D.,
>>>>>> F_‘.A.C.S. At the business meeting .of the American Cleft Palate
>>>>>> Association on May 6, 1961 in Montreal, Canada, an Honors and Awards
>>>>>> Committee was established and its duties were set forth. The Executive
>>>>>> Committee then selected Dr. Robert Ivy to be the first recipient of an
>>>>>> Honors Award. An HOnors and Awards Committee was then selected by the
>>>>>> President; serve as the current chairman. It therefore becomes personal
>>>>>> honor and privilege to me to be able to present this first award to good
>>>>>> friend. Dr. Ivy has had long and brilliant career in the field of plastic
>>>>>> surgery with particular interest in the cleft lip and palate patient. It
>>>>>> will be possible for us to mention only very few of Dr. Ivy’s many
>>>>>> accomplishments in our allotted time here today. would, therefore, like 
>>>>>> to
>>>>>> recommend to you two publications which will give you more insight into 
>>>>>> the
>>>>>> life of our honored guest.
>>>>>>
>>>>>
>>>>> I'm hoping this sample and description is also representative of other
>>>>> dropped characters, such as single numerals in pagination and single
>>>>> initials in some instances.
>>>>>
>>>>> Unfortunately, I don't have a lot of time to devote to this project,
>>>>> so anything easy and obvious which I'm missing?
>>>>>
>>>>> Thanks,
>>>>>
>>>

Re: [tesseract-ocr] Re: Does unicharset affect recognition quality ?

2017-08-25 Thread ShreeDevi Kumar
https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr

For ppa

On 25-Aug-2017 5:22 PM, "ShreeDevi Kumar" <shreesh...@gmail.com> wrote:

> Latest GitHub source in master branch is for 4.0alpha. you can install via
> post.
>
> Search for tesseract PPA Alex in Google.
>
> _sent from phone
>
> On 25-Aug-2017 4:42 PM, "Yury" <yura...@gmail.com> wrote:
>
>> Hello again.
>>
>> I found this: https://github.com/tesseract-ocr/tessdata/blob/master/
>> best/Kannada.traineddata
>>
>> But after recognition I see only english text symbols and digits, so this
>> did not work.
>> In log I see:
>>  theraysmith <https://github.com/theraysmith> Added best traineddatas
>> for 4.00 alpha
>> <https://github.com/tesseract-ocr/tessdata/commit/3a94ddd47be01fd897cbe31f05cbd2301454cf8a>
>>
>> I have 3.05.
>>
>>
>> пятница, 25 августа 2017 г., 17:47:56 UTC+7 пользователь Yury написал:
>>>
>>> Hello, shree!
>>>
>>> Can you tell me exact path for tessdata/best/*.traineddata ?
>>>
>>> пятница, 25 августа 2017 г., 16:07:49 UTC+7 пользователь shree написал:
>>>>
>>>> Have you tried the new tessdata/best/*.traineddata with the latest
>>>> github sources?
>>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/b20f906b-db90-43f1-b9c6-b1bb40d21414%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/b20f906b-db90-43f1-b9c6-b1bb40d21414%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVhNUrwfqLAo1Lj-2y27sEa-t28GuS50X_hv1QBT3xqkA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Dropped single character words

2017-08-25 Thread ShreeDevi Kumar
https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr

For the ppa

On 25-Aug-2017 12:45 AM, "ShreeDevi Kumar" <shreesh...@gmail.com> wrote:

> There is an unofficial ppa package available with latest code, if you do
> not want to build it.
>
> -- Excuse the brevity, msg sent from phone.
>
> On 25-Aug-2017 12:41 AM, "ShreeDevi Kumar" <shreesh...@gmail.com> wrote:
>
>> You can try building latest GitHub source for 4.0alpha and test with the
>> best/eng.traineddata from the tessdata repository.
>>
>> -- Excuse the brevity, msg sent from phone.
>>
>> On 25-Aug-2017 12:36 AM, "Clinton Graham" <ctgra...@pitt.edu> wrote:
>>
>>> Do you have any simple suggestions for improving OCR quality where
>>> tesseract is missing single character words like "a" and "I"?
>>>
>>> I'm using the default packages available in Ubuntu:
>>> tesseract 3.03
>>>  leptonica-1.70
>>>   libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib
>>> 1.2.8 : webp 0.4.0
>>>
>>> I've also tried updating Ubuntu, building later 3.x sources:
>>> tesseract 3.05.01
>>>  leptonica-1.74.4
>>>   libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 :
>>> zlib 1.2.8
>>>
>>> I'm using a command line run of simply:
>>> tesseract -psm 1 -l eng $f $f pdf
>>>
>>> I've also tried -psm 6 based on another forum post (though some of my
>>> input will be multicolumn).
>>>
>>> In whatever case, the first paragraph of the my TIFF (attached) is
>>> consistently read without instances of single character words:
>>>
>>> Honors Award {Presentation to Robert H. Ivy, M.D., D.D.S., Sc.D.,
>>>> F_‘.A.C.S. At the business meeting .of the American Cleft Palate
>>>> Association on May 6, 1961 in Montreal, Canada, an Honors and Awards
>>>> Committee was established and its duties were set forth. The Executive
>>>> Committee then selected Dr. Robert Ivy to be the first recipient of an
>>>> Honors Award. An HOnors and Awards Committee was then selected by the
>>>> President; serve as the current chairman. It therefore becomes personal
>>>> honor and privilege to me to be able to present this first award to good
>>>> friend. Dr. Ivy has had long and brilliant career in the field of plastic
>>>> surgery with particular interest in the cleft lip and palate patient. It
>>>> will be possible for us to mention only very few of Dr. Ivy’s many
>>>> accomplishments in our allotted time here today. would, therefore, like to
>>>> recommend to you two publications which will give you more insight into the
>>>> life of our honored guest.
>>>>
>>>
>>> I'm hoping this sample and description is also representative of other
>>> dropped characters, such as single numerals in pagination and single
>>> initials in some instances.
>>>
>>> Unfortunately, I don't have a lot of time to devote to this project, so
>>> anything easy and obvious which I'm missing?
>>>
>>> Thanks,
>>>
>>> - Clinton Graham
>>>
>>> Systems Developer
>>>
>>> University of Pittsburgh | University Library System
>>>
>>> 412-383-1057 <(412)%20383-1057>
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/e0b62d2b-2e27-4732-b4fe-8d5b78c52d98%40goo
>>> glegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/e0b62d2b-2e27-4732-b4fe-8d5b78c52d98%40googlegroups.com?utm_medium=email_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXXfSJJ%2BE7p-RZ71hhmhiK%3DFR0Q0Z2P72Nw4URyJQ9OwQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Does unicharset affect recognition quality ?

2017-08-25 Thread ShreeDevi Kumar
Latest GitHub source in master branch is for 4.0alpha. you can install via
post.

Search for tesseract PPA Alex in Google.

_sent from phone

On 25-Aug-2017 4:42 PM, "Yury"  wrote:

> Hello again.
>
> I found this: https://github.com/tesseract-ocr/tessdata/blob/
> master/best/Kannada.traineddata
>
> But after recognition I see only english text symbols and digits, so this
> did not work.
> In log I see:
>  theraysmith  Added best traineddatas for
> 4.00 alpha
> 
>
> I have 3.05.
>
>
> пятница, 25 августа 2017 г., 17:47:56 UTC+7 пользователь Yury написал:
>>
>> Hello, shree!
>>
>> Can you tell me exact path for tessdata/best/*.traineddata ?
>>
>> пятница, 25 августа 2017 г., 16:07:49 UTC+7 пользователь shree написал:
>>>
>>> Have you tried the new tessdata/best/*.traineddata with the latest
>>> github sources?
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/b20f906b-db90-43f1-b9c6-b1bb40d21414%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUL_cLN7oLzjK_6KX5YpVAuqNuMbQifJtr6jUd3VvZZKA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Does unicharset affect recognition quality ?

2017-08-25 Thread ShreeDevi Kumar
Have you tried the new tessdata/best/*.traineddata with the latest github
sources?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU82Bpa5OWzhnRdKj0%3Dh5CLetiF%2BraK12hZ33LsnHvb%2Bw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Dropped single character words

2017-08-24 Thread ShreeDevi Kumar
There is an unofficial ppa package available with latest code, if you do
not want to build it.

-- Excuse the brevity, msg sent from phone.

On 25-Aug-2017 12:41 AM, "ShreeDevi Kumar" <shreesh...@gmail.com> wrote:

> You can try building latest GitHub source for 4.0alpha and test with the
> best/eng.traineddata from the tessdata repository.
>
> -- Excuse the brevity, msg sent from phone.
>
> On 25-Aug-2017 12:36 AM, "Clinton Graham" <ctgra...@pitt.edu> wrote:
>
>> Do you have any simple suggestions for improving OCR quality where
>> tesseract is missing single character words like "a" and "I"?
>>
>> I'm using the default packages available in Ubuntu:
>> tesseract 3.03
>>  leptonica-1.70
>>   libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib
>> 1.2.8 : webp 0.4.0
>>
>> I've also tried updating Ubuntu, building later 3.x sources:
>> tesseract 3.05.01
>>  leptonica-1.74.4
>>   libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib
>> 1.2.8
>>
>> I'm using a command line run of simply:
>> tesseract -psm 1 -l eng $f $f pdf
>>
>> I've also tried -psm 6 based on another forum post (though some of my
>> input will be multicolumn).
>>
>> In whatever case, the first paragraph of the my TIFF (attached) is
>> consistently read without instances of single character words:
>>
>> Honors Award {Presentation to Robert H. Ivy, M.D., D.D.S., Sc.D.,
>>> F_‘.A.C.S. At the business meeting .of the American Cleft Palate
>>> Association on May 6, 1961 in Montreal, Canada, an Honors and Awards
>>> Committee was established and its duties were set forth. The Executive
>>> Committee then selected Dr. Robert Ivy to be the first recipient of an
>>> Honors Award. An HOnors and Awards Committee was then selected by the
>>> President; serve as the current chairman. It therefore becomes personal
>>> honor and privilege to me to be able to present this first award to good
>>> friend. Dr. Ivy has had long and brilliant career in the field of plastic
>>> surgery with particular interest in the cleft lip and palate patient. It
>>> will be possible for us to mention only very few of Dr. Ivy’s many
>>> accomplishments in our allotted time here today. would, therefore, like to
>>> recommend to you two publications which will give you more insight into the
>>> life of our honored guest.
>>>
>>
>> I'm hoping this sample and description is also representative of other
>> dropped characters, such as single numerals in pagination and single
>> initials in some instances.
>>
>> Unfortunately, I don't have a lot of time to devote to this project, so
>> anything easy and obvious which I'm missing?
>>
>> Thanks,
>>
>> - Clinton Graham
>>
>> Systems Developer
>>
>> University of Pittsburgh | University Library System
>>
>> 412-383-1057 <(412)%20383-1057>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/e0b62d2b-2e27-4732-b4fe-8d5b78c52d98%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/e0b62d2b-2e27-4732-b4fe-8d5b78c52d98%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU526DqEtr4LUf%3Dpy3oMbAfGX3Koa_aQ3RJNyTQesD3sA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Dropped single character words

2017-08-24 Thread ShreeDevi Kumar
You can try building latest GitHub source for 4.0alpha and test with the
best/eng.traineddata from the tessdata repository.

-- Excuse the brevity, msg sent from phone.

On 25-Aug-2017 12:36 AM, "Clinton Graham"  wrote:

> Do you have any simple suggestions for improving OCR quality where
> tesseract is missing single character words like "a" and "I"?
>
> I'm using the default packages available in Ubuntu:
> tesseract 3.03
>  leptonica-1.70
>   libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib
> 1.2.8 : webp 0.4.0
>
> I've also tried updating Ubuntu, building later 3.x sources:
> tesseract 3.05.01
>  leptonica-1.74.4
>   libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib
> 1.2.8
>
> I'm using a command line run of simply:
> tesseract -psm 1 -l eng $f $f pdf
>
> I've also tried -psm 6 based on another forum post (though some of my
> input will be multicolumn).
>
> In whatever case, the first paragraph of the my TIFF (attached) is
> consistently read without instances of single character words:
>
> Honors Award {Presentation to Robert H. Ivy, M.D., D.D.S., Sc.D.,
>> F_‘.A.C.S. At the business meeting .of the American Cleft Palate
>> Association on May 6, 1961 in Montreal, Canada, an Honors and Awards
>> Committee was established and its duties were set forth. The Executive
>> Committee then selected Dr. Robert Ivy to be the first recipient of an
>> Honors Award. An HOnors and Awards Committee was then selected by the
>> President; serve as the current chairman. It therefore becomes personal
>> honor and privilege to me to be able to present this first award to good
>> friend. Dr. Ivy has had long and brilliant career in the field of plastic
>> surgery with particular interest in the cleft lip and palate patient. It
>> will be possible for us to mention only very few of Dr. Ivy’s many
>> accomplishments in our allotted time here today. would, therefore, like to
>> recommend to you two publications which will give you more insight into the
>> life of our honored guest.
>>
>
> I'm hoping this sample and description is also representative of other
> dropped characters, such as single numerals in pagination and single
> initials in some instances.
>
> Unfortunately, I don't have a lot of time to devote to this project, so
> anything easy and obvious which I'm missing?
>
> Thanks,
>
> - Clinton Graham
>
> Systems Developer
>
> University of Pittsburgh | University Library System
>
> 412-383-1057 <(412)%20383-1057>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/e0b62d2b-2e27-4732-b4fe-8d5b78c52d98%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUqmNDNxz5LgNT6P_mfmHKZXu-p0M5t7XsxyOKGa0bX-A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Error in Layout Analysis with Tesseract OCR 4.0.0alpha

2017-08-23 Thread ShreeDevi Kumar
Skipping words is issue from tesseract. Amit do has a proposed patch for
it. Look in tesseract issues.

You can see if it helps in your case.

-- Excuse the brevity, msg sent from phone.

On 23-Aug-2017 9:16 PM, "Nirajan Pant"  wrote:

> Yeah! I have tried both gimagereader and vietocr as gui interface for
> tesseract for Nepali. Result from both GUI skips the words.
>
> On Wednesday, 23 August 2017 17:30:32 UTC+5:45, shree wrote:
>>
>> You could try doing your own layout analysis instead of relying o
>> tesseract's auto mode?
>>
>> Have you tried gimagereader and vietocr as gui interface for tesseract
>> for Nepali?
>>
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Wed, Aug 23, 2017 at 10:03 AM, Nirajan Pant  wrote:
>>
>>> I am working on GUI for tesseract OCR 4.0.0 (Nepali Language). When I
>>> started analysis of the recognition results I found some missing words or
>>> sentences. To find the reason behind this I just draw the boxes detected by
>>> tesseract (using hocr) recognition result. The detection was shown here-
>>>
>>>
>>> 
>>> This is a part of document with paragraph detection error. Red line is
>>> the boundary of detected paragraph (second column of original image given
>>> below).
>>>
>>> The original image is:
>>>
>>>
>>> 
>>>
>>> Help me to deal with this issue.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/ae0aa097-93ba-4424-baf5-b4ed93ca574a%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/8e726246-a186-47f7-9850-f49441e75191%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUMa4bEHTiT2%3DZdopcu0yac0B-mp5s5yj6CedURErox8A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Error in Layout Analysis with Tesseract OCR 4.0.0alpha

2017-08-23 Thread ShreeDevi Kumar
You could try doing your own layout analysis instead of relying o
tesseract's auto mode?

Have you tried gimagereader and vietocr as gui interface for tesseract for
Nepali?



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Aug 23, 2017 at 10:03 AM, Nirajan Pant  wrote:

> I am working on GUI for tesseract OCR 4.0.0 (Nepali Language). When I
> started analysis of the recognition results I found some missing words or
> sentences. To find the reason behind this I just draw the boxes detected by
> tesseract (using hocr) recognition result. The detection was shown here-
>
>
> 
> This is a part of document with paragraph detection error. Red line is the
> boundary of detected paragraph (second column of original image given
> below).
>
> The original image is:
>
>
> 
>
> Help me to deal with this issue.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/ae0aa097-93ba-4424-baf5-b4ed93ca574a%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduULk7kk5dLHej58smYJ2epztO%3DhqhCRuTUxz0n8-2MnsQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] The net_spec in the chi_sim.traineddata

2017-08-23 Thread ShreeDevi Kumar
Loaded file
./tess4training-save/tess4training-vedic/tessdata/best/Devanagari.lstm,
unpacking...

Warning: LSTMTrainer deserialized an LSTMRecognizer!

Code range changed from 217 to 157!!

Num (Extended) outputs,weights in Series:

1,48,0,1:1, 0

Num (Extended) outputs,weights in Series:

C3,3:9, 0

Ft16:16, 160

Total weights = 160

[C3,3Ft16]:16, 160

Mp3,3:16, 0

Lfys64:64, 20736

Lfx64:64, 33024

Lrx64:64, 33024

Lfx512:512, 1181696

Fc157:157, 80541

Total weights = 1349181

Previous null char=2 mapped to 2

Continuing from
./tess4training-save/tess4training-vedic/tessdata/best/Devanagari.lstm


​See the line:

​Code range changed from 217 to 157!!

That is the size of the unicharset. That is what is used in Fc.



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Aug 23, 2017 at 2:48 PM,  wrote:

> Year, I have observed the builted network at beginning of the training
> step. Thanks for reply.
>
> The basetrain.log file shows that  Built network:[1,48,0,1 [C3,3 Ft16]
> Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 Fc209] from request [1,48,0,1 Ct3,3,16
> Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c1]
>
> Some problms for understanding this builted network:
>
> 1. [C3,3 Ft16] layers in the network has been enclosed in brackets. But
> why it is enclosed in brackets? What does it stand for with the brackets?
> 2. Fc209 the last layer of this network is a Fully-connected layer.
> what's the meanings of the 'c' in this layer? I cannot find what 'Fc'
> represents in the VGSLSpecs tutorial.
>
> Thanks.
>
>
>
> 在 2017年8月23日星期三 UTC+8下午3:00:00,shree写道:
>>
>> I think that number is ignored and the actual number generated from
>> unichasrset is used.
>>
>> Usually there will be a message right at beginning of training showing
>> the number being used.
>>
>> On 23-Aug-2017 12:21 PM,  wrote:
>>
>>> Hello,
>>>
>>> I have pulled out the network of the chi_sim.traineddata with the
>>> command:  combine_tessdata -u ../../tessdata/chi_sim.traineddata
>>> ../../chi_sim_comp
>>>
>>> Then I observe the network which is shown in the chi_sim_comp file. The
>>> network is [1,48,0,1Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c1]
>>>
>>> By analyzing the VGSL Specs language, I can infer that the output layer
>>> of the network is O1c1, which means that Output layer produces 1-d
>>> (sequence) output, trained with CTC,*outputting 1 class*.
>>>
>>>
>>> Why does the output layer end up in one category? Whether the network
>>> structure recorded in the chi_sim.traineddata will be wrong?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/5f5e3422-59e4-499e-bc4d-84ed214c1523%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/d0eec6d3-11af-4953-901a-4f5e03b63b79%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUjc2pTte%3DLWch_wNYjtA7qUyiCEzWQ%2BR3_CuySXWArKA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Msg from Ray - Calling for community contribution for some languages

2017-08-23 Thread ShreeDevi Kumar
> yor.traineddata doesn't seem robust enough

I have added as an issue - see
https://github.com/tesseract-ocr/langdata/issues/89

> My project right now needs more training data to make the model more
robust. It is very tough to find properly marked yoruba text on the
internet.

You can see if info at the following page is helpful.

http://crubadan.org/languages/yo

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWZuZQjjYaek4221gNqbPTLASCQqOCg%3DGa5vEygPbyf1A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] The net_spec in the chi_sim.traineddata

2017-08-23 Thread ShreeDevi Kumar
I think that number is ignored and the actual number generated from
unichasrset is used.

Usually there will be a message right at beginning of training showing the
number being used.

On 23-Aug-2017 12:21 PM,  wrote:

> Hello,
>
> I have pulled out the network of the chi_sim.traineddata with the
> command:  combine_tessdata -u ../../tessdata/chi_sim.traineddata
> ../../chi_sim_comp
>
> Then I observe the network which is shown in the chi_sim_comp file. The
> network is [1,48,0,1Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c1]
>
> By analyzing the VGSL Specs language, I can infer that the output layer of
> the network is O1c1, which means that Output layer produces 1-d
> (sequence) output, trained with CTC,*outputting 1 class*.
>
>
> Why does the output layer end up in one category? Whether the network
> structure recorded in the chi_sim.traineddata will be wrong?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/5f5e3422-59e4-499e-bc4d-84ed214c1523%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXpSvtmvkWCPydv3zJ1uD%3DN_e5wcyUCvpvBFck4ktCiuA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Training from scratch to re-train the chi_sim.traineddata for studying

2017-08-22 Thread ShreeDevi Kumar
The files will be at Google. You have to wait till Ray Smith updates the
repository.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Aug 22, 2017 at 12:58 PM,  wrote:

> Thanks for your reply.
>
> Do you know where can I find the new langdata files?
>
> 在 2017年8月22日星期二 UTC+8下午3:22:36,shree写道:
>>
>> The langdata files have not been updated for 4.00alpha
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, Aug 22, 2017 at 12:17 PM,  wrote:
>>
>>> Hello,
>>>
>>> I'm trying to re-train the chi_sim.traineddata model from scratch for
>>> studying.
>>>
>>> I use the source data of chi_sim.training_text in the link directory
>>> https://github.com/tesseract-ocr/langdata/tree/master/chi_sim to train
>>> the model with the command:
>>>
>>> training/lstmtraining --debug_interval 100 \
>>> --traineddata ~/tesstutorial/trainspecial/chi_sim/chi_sim.traineddata \
>>> --net_spec '[1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c1]' \
>>> --model_output ~/tesstutorial/specialoutput/base --learning_rate 20e-4 \
>>> --train_listfile ~/tesstutorial/trainspecial/chi_sim.training_files.txt \
>>> --eval_listfile ~/tesstutorial/evalspecial/chi_sim.training_files.txt \
>>> --max_iterations 3600 &>~/tesstutorial/specialoutput/basetrain.log
>>>
>>>
>>>
>>> The net_spec is same as the official model package (chi_sim.traineddata
>>> from the tessdata github).
>>>
>>>
>>>
>>> After converting the training model with the lstmtraining
>>> --stop_training, a new chi_sim.traineddata model gererated, which is named
>>> chi_sim_new.traineddata.
>>> And I name the official chi_sim.traineddata as chi_sim.traineddata for
>>> distinguishing.
>>>
>>>
>>> Then I pull out all the characters in the two traineddata model.
>>>
>>> There are 4384 characters in the chi_sim.traineddata, but 2538
>>> characters in the chi_sim_new.traineddata which is generated by me.
>>>
>>> Why are there different characters in the two models? Does the source
>>> data in the chi_sim.training_text haven't updated in time?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/e3f0-588b-456f-90bf-a878f20b1f26%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/b96558c2-1555-41c8-bcb0-0282efeb3556%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXhBRwzXCpYNUiSkUQ2iZinhL8EfVU5hAVqEBY3UrkTAQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Training from scratch to re-train the chi_sim.traineddata for studying

2017-08-22 Thread ShreeDevi Kumar
The langdata files have not been updated for 4.00alpha

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Aug 22, 2017 at 12:17 PM,  wrote:

> Hello,
>
> I'm trying to re-train the chi_sim.traineddata model from scratch for
> studying.
>
> I use the source data of chi_sim.training_text in the link directory
> https://github.com/tesseract-ocr/langdata/tree/master/chi_sim to train
> the model with the command:
>
> training/lstmtraining --debug_interval 100 \
> --traineddata ~/tesstutorial/trainspecial/chi_sim/chi_sim.traineddata \
> --net_spec '[1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c1]' \
> --model_output ~/tesstutorial/specialoutput/base --learning_rate 20e-4 \
> --train_listfile ~/tesstutorial/trainspecial/chi_sim.training_files.txt \
> --eval_listfile ~/tesstutorial/evalspecial/chi_sim.training_files.txt \
> --max_iterations 3600 &>~/tesstutorial/specialoutput/basetrain.log
>
>
>
> The net_spec is same as the official model package (chi_sim.traineddata
> from the tessdata github).
>
>
>
> After converting the training model with the lstmtraining --stop_training,
> a new chi_sim.traineddata model gererated, which is named
> chi_sim_new.traineddata.
> And I name the official chi_sim.traineddata as chi_sim.traineddata for
> distinguishing.
>
>
> Then I pull out all the characters in the two traineddata model.
>
> There are 4384 characters in the chi_sim.traineddata, but 2538 characters
> in the chi_sim_new.traineddata which is generated by me.
>
> Why are there different characters in the two models? Does the source data
> in the chi_sim.training_text haven't updated in time?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/e3f0-588b-456f-90bf-a878f20b1f26%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVPntY-Aqh8mFC35AGmEC1hLhsr-Mu5UukXmhUaMWpa%2Bw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Error:Assert failed:in file ../lstm/lstmtrainer.h, line 110

2017-08-21 Thread ShreeDevi Kumar
lstm file is the language model. It is saved in traineddata file.

dawgs are a kind of compressed files, created from lists of words,
punctuation or numbers.

You can use dawg2wordlist to unpack them.

Please follow the instructions on the training wiki page.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUHTAxNsd81rf6Pm3yL1kZpnsDqcocm2%2BRAdWuGde5hRQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Error:Assert failed:in file ../lstm/lstmtrainer.h, line 110

2017-08-21 Thread ShreeDevi Kumar
training/combine_tessdata -e tessdata/best/eng.traineddata \
  ~/tesstutorial/impact_from_full/eng.lstm


On 04-Aug-2017 12:03 PM,  wrote:

> Hello,
>
> I use the 'git pull' command to update the code from the link
> https://github.com/tesseract-ocr/tesseract.git, and I recompile,
> reinstall the Tess4.0.
>
> But when I execute the command (showed in below) to finetune the
> traineddata, an error appears: 
> "mgr_.Init(traineddata_path.c_str()):Error:Assert
> failed:in file ../lstm/lstmtrainer.h, line 110"
>
> lstmtraining --model_output ~/tesstutorial/chituned_from_chisim/chituned \
> --continue_from ~/tesstutorial/chituned_from_chisim/chi_sim.lstm \
> --train_listfile ~/tesstutorial/chitest/chi_sim.training_files.txt \
> --eval_listfile ~/tesstutorial/chitest/chi_sim.training_files.txt \
> --target_error_rate 0.01
>
>
>
> There is nothing wrong with the Tess before updating the code. But now, An
> assertion error crashes. Why? Can you help me?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/75ba4766-370a-46c0-88b0-a15456aa7c9f%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX50NmMX9hk5Mjm4MijuSC8Rb%2BD4M9hxTUWh8Kmrtz1qg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] where can i find chinese original training data for re-train tesseract 4.0

2017-08-18 Thread ShreeDevi Kumar
The lead developer of tesseract-ocr is Ray Smith (at Google). @theraysmith
on github

He is in the process of updating the files for 4.0.0 beta release soon.

see
https://github.com/tesseract-ocr/langdata/issues/35#issuecomment-320330996

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Aug 18, 2017 at 2:14 PM, <514358...@qq.com> wrote:

> who is Ray?  How to contact him?
>
> 在 2017年8月18日星期五 UTC+8下午4:33:16,shree写道:
>>
>> langdata has NOT been updated for 4.0.
>>
>> Please wait for update from Ray.
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Fri, Aug 18, 2017 at 12:42 PM, <5143...@qq.com> wrote:
>>
>>> hi,all:
>>>
>>> I want to  re-train tesseract 4.0 for chinese , i find
>>> https://github.com/tesseract-ocr/langdata   just for  tesseract 3.0,
>>>
>>> Appreciate for any help.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/054046ac-8ff0-44fa-9361-12711de7fbf8%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/2c8b1e67-5069-4b42-8368-5f99bf2a7646%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWnfmDMj%2B_93VXZ-j2bENcDGp24VC-orbLiZBLPM%2BBmGg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] where can i find chinese original training data for re-train tesseract 4.0

2017-08-18 Thread ShreeDevi Kumar
langdata has NOT been updated for 4.0.

Please wait for update from Ray.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Aug 18, 2017 at 12:42 PM, <514358...@qq.com> wrote:

> hi,all:
>
> I want to  re-train tesseract 4.0 for chinese , i find
> https://github.com/tesseract-ocr/langdata   just for  tesseract 3.0,
>
> Appreciate for any help.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/054046ac-8ff0-44fa-9361-12711de7fbf8%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU5CZGyEaTn2B5vuDjWRdPWJcJ%3D6O6_zWeUwfWZmOLO_A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Improve the accuracy rate by training Tesseract4.0(LSTM), using fine tune

2017-08-18 Thread ShreeDevi Kumar
2017-08-18 12:48 GMT+05:30 <514358...@qq.com>:

> chi_sim.traineddata is not for LSTM4.0
>
>
​That is not correct.

https://github.com/tesseract-ocr/tessdata/blob/master/best/chi_sim.traineddata
​

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVk_6rYu8TgQ5Yr50vhRAM%3Dm8GTpgY%2BxnDLw7roYGEkvg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Newbie: wondering why a fairly crisp document has such low accuracy

2017-08-12 Thread ShreeDevi Kumar
With English you should probably get close to 99% accuracy.

Is your png at 300 dpi?

Which version of tesseract did you use?
Which traineddata?

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Aug 12, 2017 at 11:46 PM, Stephen Boesch  wrote:

> I printed out the "Welcome" page on my HP laserjet printer and scanned it
> in using .png .  The quality is quite good. So I had been  anticipating
> maybe 85%+ accuracy on the tesseract-OCR. I did not even bother to tally
> carefullly - but by eyeballing it seems about  50%.I had used all
> default settings.
>
> Some of the consistent errors:
>
> W -> H
> in -> m
> li -> h
> b -> t)
> ll -> H
>
> So is this just "the way things are" in OCR land?  Or am I missing some
> fundamental settings here - to get some reasonable usefulness?
>
> thanks
>
> stephenb
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/c7bc553d-6f89-4c52-a48a-2d2365b646c7%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVTY0XZ%2BFAD6xp%2BKOrE946J6EEJS0A9ihRPb%2BwVW%2BoGXQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Creation of encoded unicharset failed While constructing LSTM training data.

2017-08-10 Thread ShreeDevi Kumar
​Seems to work fine for me.

Are you sure that you have relevant files in the  directories listed in
that command?

check tessdata, langdata location.

Use tessdata/best/*.traineddata as the existing models.​

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Aug 10, 2017 at 2:05 PM,  wrote:

> Hello,
>
> I'm trying to finetune the end.traineddata model as the steps in the link:
> https://github.com/tesseract-ocr/tesseract/wiki/
> TrainingTesseract-4.00#fine-tuning-for-%C2%B1-a-few-characters
>
> As the tutorail shows, I fine tuning for ± a few characters following the
> steps.
>
> But when I execute the first command, to generate new training and eval
> data:
>
> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only 
> \
>   --noextract_font_properties --langdata_dir ../langdata \
>   --tessdata_dir ./tessdata --output_dir ~/tesstutorial/trainplusminus
>
>
> An error is prompted: *Creation of encoded unicharset failed! *While
> constructing LSTM training data.
>
> More details refer to the image.
>
> Can you help me? Thanks.
>
>
>
> 
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/1c40ba47-a6e5-4ec9-bf58-677bcdb2f74b%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWSqtqzPB0VP4nc%2B-en9wkYZ8dhEm-P8v%2BG_QFrzs59%3DA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] 4.0-training

2017-08-08 Thread ShreeDevi Kumar
The training instructions for 4.0 have changed. Please see the wiki.

Which language are you trying to train?

Have you tried the current tessdata/best/*.traineddata model?

What's your feedback on those?

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Aug 7, 2017 at 12:45 PM, Richard Foo  wrote:

> Hi there,
>
> I'm trying to train tesseract 4.0 for a language by providing a large
> training text. I am not sure what should I do after reading the document.
> Just replacing the training text in langdata and installing fonts we need
> before running tesstrain.sh?
>
> Thanks,
> Richard
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/c6c0b22d-9ffc-43b2-a1db-5cf8876f%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUevkvs509iZZDLTaXwrYgqR52dSVcQxNGgEUBJtBgmBQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Error:Assert failed:in file ../lstm/lstmtrainer.h, line 110

2017-08-07 Thread ShreeDevi Kumar
You also need to provide a traineddata file as input

Please review the updated training instructions in the wiki and change the
training commands accordingly.

On 07-Aug-2017 6:15 PM, "Ava Nimaee"  wrote:

> hi how can you solve it? i have this error too.
> please help me
>
> On Friday, August 4, 2017 at 11:03:41 AM UTC+4:30, roberty...@gmail.com
> wrote:
>>
>> Hello,
>>
>> I use the 'git pull' command to update the code from the link
>> https://github.com/tesseract-ocr/tesseract.git, and I recompile,
>> reinstall the Tess4.0.
>>
>> But when I execute the command (showed in below) to finetune the
>> traineddata, an error appears: 
>> "mgr_.Init(traineddata_path.c_str()):Error:Assert
>> failed:in file ../lstm/lstmtrainer.h, line 110"
>>
>> lstmtraining --model_output ~/tesstutorial/chituned_from_chisim/chituned
>> \
>> --continue_from ~/tesstutorial/chituned_from_chisim/chi_sim.lstm \
>> --train_listfile ~/tesstutorial/chitest/chi_sim.training_files.txt \
>> --eval_listfile ~/tesstutorial/chitest/chi_sim.training_files.txt \
>> --target_error_rate 0.01
>>
>>
>>
>> There is nothing wrong with the Tess before updating the code. But now,
>> An assertion error crashes. Why? Can you help me?
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/7c66d368-f232-4eed-abfc-3bba2418f024%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVTZfEicGdzfzrunDE45raqBFmJRxY4PHsKMGdfw_OAZg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] ERROR: Non-existent flag --traineddata

2017-08-07 Thread ShreeDevi Kumar
There have been changes since then.

Either update your git repository via

git pull origin

or

clone it again.

​

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Aug 7, 2017 at 12:26 PM, Ava Nimaee <beigy.zoh...@gmail.com> wrote:

>  about 3 weeks ago
>
>
> On Sunday, August 6, 2017 at 7:59:44 AM UTC+4:30, shree wrote:
>>
>> >Invalid format in radical table at line 4: 3400 1.4
>>
>> When did you clone langdata?
>>
>> Ray has updated radical-stroke.txt 11 days ago - see
>> https://github.com/tesseract-ocr/langdata/commit/3e32be3
>> dc07be0994f3687664a44cb3246b5aa11
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Sat, Aug 5, 2017 at 10:56 PM, Ava Nimaee <beigy@gmail.com> wrote:
>>
>>> thank for your attention
>>> i remove all and install again last version tesseract and leptonica and
>>> use this syntax
>>> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng
>>>  --training_text training/langdata/eng/eng.training_text
>>> --linedata_only \
>>>   --noextract_font_properties --langdata_dir training/langdata \
>>>   --tessdata_dir ./tessdata \
>>>   --fontlist "Times New Roman," --output_dir ~/tesstutorial/engtrian
>>>
>>> but got a new error. all of things is ok but at the end took this:
>>>
>>> Setting unichar properties
>>> Other case É of é is not in unicharset
>>> Setting script properties
>>> Failed to read data from: training/langdata/eng/eng.config
>>> Null char=2
>>> Invalid format in radical table at line 4: 3400 1.4
>>> Creation of encoded unicharset failed!!
>>> Error writing recoder!!
>>> Reducing Trie to SquishedDawg
>>> Reducing Trie to SquishedDawg
>>> Reducing Trie to SquishedDawg
>>> Moving /tmp/tmp.GW5DOJr0rG/eng/eng.Times_New_Roman.exp0.lstmf to
>>> /home/zohreh/tesstutorial/engtrian
>>>
>>> Completed training for language 'eng'
>>> and i dont have eng.config my langdata . i clone langdata from git's
>>> tesseract
>>>
>>>
>>> On Saturday, August 5, 2017 at 5:50:59 PM UTC+4:30, shree wrote:
>>>>
>>>> ​tesseract -v
>>>> tesseract 4.00.00dev-594-g044e06e-2085
>>>>  leptonica-1.74.4
>>>>   libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 :
>>>> zlib 1.2.8
>>>>
>>>>  Found AVX
>>>>  Found SSE
>>>>
>>>>
>>>> The above version is working ok on linux
>>>>
>>>>  nice lstmtraining \
>>>>--old_traineddata ../tessdata/best/san.traineddata \
>>>>   --continue_from ../tessdata/best/san.lstm \
>>>>--traineddata ../tesstutorial/vedic/san/san.traineddata  \
>>>>--train_listfile ../tesstutorial/vedic/san.training_files.txt \
>>>>--eval_listfile ../tesstutorial/vedic/san.eval_files.txt \
>>>>   --model_output ../tesstutorial/vedic/santune \
>>>>   --max_iterations 200 \
>>>>--debug_interval 0
>>>>
>>>> Loaded file ../tessdata/best/san.lstm, unpacking...
>>>> Warning: LSTMTrainer deserialized an LSTMRecognizer!
>>>> Code range changed from 145 to 2308!!
>>>> Num (Extended) outputs,weights in Series:
>>>>   1,36,0,1:1, 0
>>>> Num (Extended) outputs,weights in Series:
>>>>   C3,3:9, 0
>>>>   Ft16:16, 160
>>>> Total weights = 160
>>>>   [C3,3Ft16]:16, 160
>>>>   Mp3,3:16, 0
>>>>   Lfys48:48, 12480
>>>>   Lfx96:96, 55680
>>>>   Lrx96:96, 74112
>>>>   Lfx192:192, 221952
>>>>   Fc2308:2308, 445444
>>>> Total weights = 809828
>>>> Previous null char=2 mapped to 2
>>>> Continuing from ../tessdata/best/san.lstm
>>>> Loaded 138/138 pages (1-138) of document ../tesstutorial/vedic/san.AA_N
>>>> AGARI_SHREE_L3.exp0.lstmf
>>>> Loaded 138/138 pages (1-138) of document ../tesstutorial/vedic/san.AA_N
>>>> AGARI_SHREE_L3.exp-1.lstmf
>>>> Loaded 138/138 pages (1-138) of document ../tesstutorial/vedic/san.Adob
>>>> e_Devanagari.exp-2.lstmf
>>>> Loaded 138/138 pages (1-138) of document ../tesstutorial/vedic/san.Adob
>>>> e_Devanagari.exp1.lstmf
>>>>
>>>>
>>>> ShreeDevi

Re: [tesseract-ocr] ERROR: Non-existent flag --traineddata

2017-08-05 Thread ShreeDevi Kumar
>Invalid format in radical table at line 4: 3400 1.4

When did you clone langdata?

Ray has updated radical-stroke.txt 11 days ago - see
https://github.com/tesseract-ocr/langdata/commit/3e32be3dc07be0994f3687664a44cb3246b5aa11

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Aug 5, 2017 at 10:56 PM, Ava Nimaee <beigy.zoh...@gmail.com> wrote:

> thank for your attention
> i remove all and install again last version tesseract and leptonica and
> use this syntax
> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng
>  --training_text training/langdata/eng/eng.training_text
> --linedata_only \
>   --noextract_font_properties --langdata_dir training/langdata \
>   --tessdata_dir ./tessdata \
>   --fontlist "Times New Roman," --output_dir ~/tesstutorial/engtrian
>
> but got a new error. all of things is ok but at the end took this:
>
> Setting unichar properties
> Other case É of é is not in unicharset
> Setting script properties
> Failed to read data from: training/langdata/eng/eng.config
> Null char=2
> Invalid format in radical table at line 4: 3400 1.4
> Creation of encoded unicharset failed!!
> Error writing recoder!!
> Reducing Trie to SquishedDawg
> Reducing Trie to SquishedDawg
> Reducing Trie to SquishedDawg
> Moving /tmp/tmp.GW5DOJr0rG/eng/eng.Times_New_Roman.exp0.lstmf to
> /home/zohreh/tesstutorial/engtrian
>
> Completed training for language 'eng'
> and i dont have eng.config my langdata . i clone langdata from git's
> tesseract
>
>
> On Saturday, August 5, 2017 at 5:50:59 PM UTC+4:30, shree wrote:
>>
>> ​tesseract -v
>> tesseract 4.00.00dev-594-g044e06e-2085
>>  leptonica-1.74.4
>>   libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib
>> 1.2.8
>>
>>  Found AVX
>>  Found SSE
>>
>>
>> The above version is working ok on linux
>>
>>  nice lstmtraining \
>>--old_traineddata ../tessdata/best/san.traineddata \
>>   --continue_from ../tessdata/best/san.lstm \
>>--traineddata ../tesstutorial/vedic/san/san.traineddata  \
>>--train_listfile ../tesstutorial/vedic/san.training_files.txt \
>>--eval_listfile ../tesstutorial/vedic/san.eval_files.txt \
>>   --model_output ../tesstutorial/vedic/santune \
>>   --max_iterations 200 \
>>--debug_interval 0
>>
>> Loaded file ../tessdata/best/san.lstm, unpacking...
>> Warning: LSTMTrainer deserialized an LSTMRecognizer!
>> Code range changed from 145 to 2308!!
>> Num (Extended) outputs,weights in Series:
>>   1,36,0,1:1, 0
>> Num (Extended) outputs,weights in Series:
>>   C3,3:9, 0
>>   Ft16:16, 160
>> Total weights = 160
>>   [C3,3Ft16]:16, 160
>>   Mp3,3:16, 0
>>   Lfys48:48, 12480
>>   Lfx96:96, 55680
>>   Lrx96:96, 74112
>>   Lfx192:192, 221952
>>   Fc2308:2308, 445444
>> Total weights = 809828
>> Previous null char=2 mapped to 2
>> Continuing from ../tessdata/best/san.lstm
>> Loaded 138/138 pages (1-138) of document ../tesstutorial/vedic/san.AA_N
>> AGARI_SHREE_L3.exp0.lstmf
>> Loaded 138/138 pages (1-138) of document ../tesstutorial/vedic/san.AA_N
>> AGARI_SHREE_L3.exp-1.lstmf
>> Loaded 138/138 pages (1-138) of document ../tesstutorial/vedic/san.Adob
>> e_Devanagari.exp-2.lstmf
>> Loaded 138/138 pages (1-138) of document ../tesstutorial/vedic/san.Adob
>> e_Devanagari.exp1.lstmf
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Sat, Aug 5, 2017 at 6:43 PM, ShreeDevi Kumar <shree...@gmail.com>
>> wrote:
>>
>>> did you build the training tools again?
>>>
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Sat, Aug 5, 2017 at 6:37 PM, Ava Nimaee <beigy@gmail.com> wrote:
>>>
>>>> yes, you said me and i clone last tesseract-master and insatll it and
>>>> leptoica again and make tiff and box file and unicharest and then use this
>>>> syntax:
>>>> training/tesstrain.sh \
>>>>   --fonts_dir /usr/share/fonts \
>>>>   --lang eng  \
>>>>   --training_text langdata/eng/eng.training_text \
>>>>   --linedata_only \
>>>>   --noextract_font_properties  --langdata_dir langdata \
>>>>   --tessdata_dir ./tessdata \
>>>>   --fontlist "Times New Roman," \
>>>&g

Re: [tesseract-ocr] ERROR: Non-existent flag --traineddata

2017-08-05 Thread ShreeDevi Kumar
I have not tried with english.

Please create an eng.config file in your langdata directory and then try

You can put the following 2 lines in it.

# Use LSTM
tessedit_ocr_engine_mode 1


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Aug 5, 2017 at 10:56 PM, Ava Nimaee <beigy.zoh...@gmail.com> wrote:

> thank for your attention
> i remove all and install again last version tesseract and leptonica and
> use this syntax
> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng
>  --training_text training/langdata/eng/eng.training_text
> --linedata_only \
>   --noextract_font_properties --langdata_dir training/langdata \
>   --tessdata_dir ./tessdata \
>   --fontlist "Times New Roman," --output_dir ~/tesstutorial/engtrian
>
> but got a new error. all of things is ok but at the end took this:
>
> Setting unichar properties
> Other case É of é is not in unicharset
> Setting script properties
> Failed to read data from: training/langdata/eng/eng.config
> Null char=2
> Invalid format in radical table at line 4: 3400 1.4
> Creation of encoded unicharset failed!!
> Error writing recoder!!
> Reducing Trie to SquishedDawg
> Reducing Trie to SquishedDawg
> Reducing Trie to SquishedDawg
> Moving /tmp/tmp.GW5DOJr0rG/eng/eng.Times_New_Roman.exp0.lstmf to
> /home/zohreh/tesstutorial/engtrian
>
> Completed training for language 'eng'
> and i dont have eng.config my langdata . i clone langdata from git's
> tesseract
>
>
> On Saturday, August 5, 2017 at 5:50:59 PM UTC+4:30, shree wrote:
>>
>> ​tesseract -v
>> tesseract 4.00.00dev-594-g044e06e-2085
>>  leptonica-1.74.4
>>   libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib
>> 1.2.8
>>
>>  Found AVX
>>  Found SSE
>>
>>
>> The above version is working ok on linux
>>
>>  nice lstmtraining \
>>--old_traineddata ../tessdata/best/san.traineddata \
>>   --continue_from ../tessdata/best/san.lstm \
>>--traineddata ../tesstutorial/vedic/san/san.traineddata  \
>>--train_listfile ../tesstutorial/vedic/san.training_files.txt \
>>--eval_listfile ../tesstutorial/vedic/san.eval_files.txt \
>>   --model_output ../tesstutorial/vedic/santune \
>>   --max_iterations 200 \
>>--debug_interval 0
>>
>> Loaded file ../tessdata/best/san.lstm, unpacking...
>> Warning: LSTMTrainer deserialized an LSTMRecognizer!
>> Code range changed from 145 to 2308!!
>> Num (Extended) outputs,weights in Series:
>>   1,36,0,1:1, 0
>> Num (Extended) outputs,weights in Series:
>>   C3,3:9, 0
>>   Ft16:16, 160
>> Total weights = 160
>>   [C3,3Ft16]:16, 160
>>   Mp3,3:16, 0
>>   Lfys48:48, 12480
>>   Lfx96:96, 55680
>>   Lrx96:96, 74112
>>   Lfx192:192, 221952
>>   Fc2308:2308, 445444
>> Total weights = 809828
>> Previous null char=2 mapped to 2
>> Continuing from ../tessdata/best/san.lstm
>> Loaded 138/138 pages (1-138) of document ../tesstutorial/vedic/san.AA_N
>> AGARI_SHREE_L3.exp0.lstmf
>> Loaded 138/138 pages (1-138) of document ../tesstutorial/vedic/san.AA_N
>> AGARI_SHREE_L3.exp-1.lstmf
>> Loaded 138/138 pages (1-138) of document ../tesstutorial/vedic/san.Adob
>> e_Devanagari.exp-2.lstmf
>> Loaded 138/138 pages (1-138) of document ../tesstutorial/vedic/san.Adob
>> e_Devanagari.exp1.lstmf
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Sat, Aug 5, 2017 at 6:43 PM, ShreeDevi Kumar <shree...@gmail.com>
>> wrote:
>>
>>> did you build the training tools again?
>>>
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Sat, Aug 5, 2017 at 6:37 PM, Ava Nimaee <beigy@gmail.com> wrote:
>>>
>>>> yes, you said me and i clone last tesseract-master and insatll it and
>>>> leptoica again and make tiff and box file and unicharest and then use this
>>>> syntax:
>>>> training/tesstrain.sh \
>>>>   --fonts_dir /usr/share/fonts \
>>>>   --lang eng  \
>>>>   --training_text langdata/eng/eng.training_text \
>>>>   --linedata_only \
>>>>   --noextract_font_properties  --langdata_dir langdata \
>>>>   --tessdata_dir ./tessdata \
>>>>   --fontlist "Times New Roman," \
>>>>   --output_dir tesstuto

Re: [tesseract-ocr] ERROR: Non-existent flag --traineddata

2017-08-05 Thread ShreeDevi Kumar
Are you using linux or windows?

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Aug 5, 2017 at 6:55 PM, Ava Nimaee <beigy.zoh...@gmail.com> wrote:

> thanks alot i try again
>
>
> On Saturday, August 5, 2017 at 5:50:59 PM UTC+4:30, shree wrote:
>
>> ​tesseract -v
>> tesseract 4.00.00dev-594-g044e06e-2085
>>  leptonica-1.74.4
>>   libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib
>> 1.2.8
>>
>>  Found AVX
>>  Found SSE
>>
>>
>> The above version is working ok on linux
>>
>>  nice lstmtraining \
>>--old_traineddata ../tessdata/best/san.traineddata \
>>   --continue_from ../tessdata/best/san.lstm \
>>--traineddata ../tesstutorial/vedic/san/san.traineddata  \
>>--train_listfile ../tesstutorial/vedic/san.training_files.txt \
>>--eval_listfile ../tesstutorial/vedic/san.eval_files.txt \
>>   --model_output ../tesstutorial/vedic/santune \
>>   --max_iterations 200 \
>>--debug_interval 0
>>
>> Loaded file ../tessdata/best/san.lstm, unpacking...
>> Warning: LSTMTrainer deserialized an LSTMRecognizer!
>> Code range changed from 145 to 2308!!
>> Num (Extended) outputs,weights in Series:
>>   1,36,0,1:1, 0
>> Num (Extended) outputs,weights in Series:
>>   C3,3:9, 0
>>   Ft16:16, 160
>> Total weights = 160
>>   [C3,3Ft16]:16, 160
>>   Mp3,3:16, 0
>>   Lfys48:48, 12480
>>   Lfx96:96, 55680
>>   Lrx96:96, 74112
>>   Lfx192:192, 221952
>>   Fc2308:2308, 445444
>> Total weights = 809828
>> Previous null char=2 mapped to 2
>> Continuing from ../tessdata/best/san.lstm
>> Loaded 138/138 pages (1-138) of document ../tesstutorial/vedic/san.AA_N
>> AGARI_SHREE_L3.exp0.lstmf
>> Loaded 138/138 pages (1-138) of document ../tesstutorial/vedic/san.AA_N
>> AGARI_SHREE_L3.exp-1.lstmf
>> Loaded 138/138 pages (1-138) of document ../tesstutorial/vedic/san.Adob
>> e_Devanagari.exp-2.lstmf
>> Loaded 138/138 pages (1-138) of document ../tesstutorial/vedic/san.Adob
>> e_Devanagari.exp1.lstmf
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Sat, Aug 5, 2017 at 6:43 PM, ShreeDevi Kumar <shree...@gmail.com>
>> wrote:
>>
>>> did you build the training tools again?
>>>
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Sat, Aug 5, 2017 at 6:37 PM, Ava Nimaee <beigy@gmail.com> wrote:
>>>
>>>> yes, you said me and i clone last tesseract-master and insatll it and
>>>> leptoica again and make tiff and box file and unicharest and then use this
>>>> syntax:
>>>> training/tesstrain.sh \
>>>>   --fonts_dir /usr/share/fonts \
>>>>   --lang eng  \
>>>>   --training_text langdata/eng/eng.training_text \
>>>>   --linedata_only \
>>>>   --noextract_font_properties  --langdata_dir langdata \
>>>>   --tessdata_dir ./tessdata \
>>>>   --fontlist "Times New Roman," \
>>>>   --output_dir tesstutorial/engtrian
>>>> 
>>>> training/tesstrain.sh \
>>>>   --fonts_dir /usr/share/fonts \
>>>>   --lang eng  \
>>>>   --training_text langdata/eng/eng.training_text \
>>>>   --linedata_only \
>>>>   --noextract_font_properties  --langdata_dir langdata \
>>>>   --tessdata_dir ./tessdata \
>>>>   --output_dir tesstutorial/engeval
>>>> and finally i use the last code that i said took error.
>>>> and for last syntax i put langdata/eng on folder of engtrian
>>>>
>>>>
>>>> On Saturday, August 5, 2017 at 5:28:48 PM UTC+4:30, shree wrote:
>>>>>
>>>>> Are you using the latest source of programs from github for building
>>>>> tesseract?
>>>>>
>>>>> ShreeDevi
>>>>> 
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>>> On Sat, Aug 5, 2017 at 6:21 PM, Ava Nimaee <beigy@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>> i used this syntax:
>>>>>>

Re: [tesseract-ocr] ERROR: Non-existent flag --traineddata

2017-08-05 Thread ShreeDevi Kumar
​tesseract -v
tesseract 4.00.00dev-594-g044e06e-2085
 leptonica-1.74.4
  libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib
1.2.8

 Found AVX
 Found SSE


The above version is working ok on linux

 nice lstmtraining \
   --old_traineddata ../tessdata/best/san.traineddata \
  --continue_from ../tessdata/best/san.lstm \
   --traineddata ../tesstutorial/vedic/san/san.traineddata  \
   --train_listfile ../tesstutorial/vedic/san.training_files.txt \
   --eval_listfile ../tesstutorial/vedic/san.eval_files.txt \
  --model_output ../tesstutorial/vedic/santune \
  --max_iterations 200 \
   --debug_interval 0

Loaded file ../tessdata/best/san.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 145 to 2308!!
Num (Extended) outputs,weights in Series:
  1,36,0,1:1, 0
Num (Extended) outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  Lfys48:48, 12480
  Lfx96:96, 55680
  Lrx96:96, 74112
  Lfx192:192, 221952
  Fc2308:2308, 445444
Total weights = 809828
Previous null char=2 mapped to 2
Continuing from ../tessdata/best/san.lstm
Loaded 138/138 pages (1-138) of document ../tesstutorial/vedic/san.AA_
NAGARI_SHREE_L3.exp0.lstmf
Loaded 138/138 pages (1-138) of document ../tesstutorial/vedic/san.AA_
NAGARI_SHREE_L3.exp-1.lstmf
Loaded 138/138 pages (1-138) of document ../tesstutorial/vedic/san.
Adobe_Devanagari.exp-2.lstmf
Loaded 138/138 pages (1-138) of document ../tesstutorial/vedic/san.
Adobe_Devanagari.exp1.lstmf


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Aug 5, 2017 at 6:43 PM, ShreeDevi Kumar <shreesh...@gmail.com>
wrote:

> did you build the training tools again?
>
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Sat, Aug 5, 2017 at 6:37 PM, Ava Nimaee <beigy.zoh...@gmail.com> wrote:
>
>> yes, you said me and i clone last tesseract-master and insatll it and
>> leptoica again and make tiff and box file and unicharest and then use this
>> syntax:
>> training/tesstrain.sh \
>>   --fonts_dir /usr/share/fonts \
>>   --lang eng  \
>>   --training_text langdata/eng/eng.training_text \
>>   --linedata_only \
>>   --noextract_font_properties  --langdata_dir langdata \
>>   --tessdata_dir ./tessdata \
>>   --fontlist "Times New Roman," \
>>   --output_dir tesstutorial/engtrian
>> 
>> training/tesstrain.sh \
>>   --fonts_dir /usr/share/fonts \
>>   --lang eng  \
>>   --training_text langdata/eng/eng.training_text \
>>   --linedata_only \
>>   --noextract_font_properties  --langdata_dir langdata \
>>   --tessdata_dir ./tessdata \
>>   --output_dir tesstutorial/engeval
>> and finally i use the last code that i said took error.
>> and for last syntax i put langdata/eng on folder of engtrian
>>
>>
>> On Saturday, August 5, 2017 at 5:28:48 PM UTC+4:30, shree wrote:
>>>
>>> Are you using the latest source of programs from github for building
>>> tesseract?
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Sat, Aug 5, 2017 at 6:21 PM, Ava Nimaee <beigy@gmail.com> wrote:
>>>
>>>> Hi
>>>> i used this syntax:
>>>>
>>>> training/lstmtraining --debug_interval 100 \
>>>>   --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
>>>>   --model_output ~/tesstutorial/engoutput/base --learning_rate 20e-4 \
>>>>   --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
>>>>   --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
>>>>   --max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log
>>>>
>>>> and put eng.traineddata on right path but has an error:
>>>>
>>>> ERROR: Non-existent flag --traineddata
>>>>
>>>> can you help me?
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google

Re: [tesseract-ocr] ERROR: Non-existent flag --traineddata

2017-08-05 Thread ShreeDevi Kumar
did you build the training tools again?


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Aug 5, 2017 at 6:37 PM, Ava Nimaee  wrote:

> yes, you said me and i clone last tesseract-master and insatll it and
> leptoica again and make tiff and box file and unicharest and then use this
> syntax:
> training/tesstrain.sh \
>   --fonts_dir /usr/share/fonts \
>   --lang eng  \
>   --training_text langdata/eng/eng.training_text \
>   --linedata_only \
>   --noextract_font_properties  --langdata_dir langdata \
>   --tessdata_dir ./tessdata \
>   --fontlist "Times New Roman," \
>   --output_dir tesstutorial/engtrian
> 
> training/tesstrain.sh \
>   --fonts_dir /usr/share/fonts \
>   --lang eng  \
>   --training_text langdata/eng/eng.training_text \
>   --linedata_only \
>   --noextract_font_properties  --langdata_dir langdata \
>   --tessdata_dir ./tessdata \
>   --output_dir tesstutorial/engeval
> and finally i use the last code that i said took error.
> and for last syntax i put langdata/eng on folder of engtrian
>
>
> On Saturday, August 5, 2017 at 5:28:48 PM UTC+4:30, shree wrote:
>>
>> Are you using the latest source of programs from github for building
>> tesseract?
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Sat, Aug 5, 2017 at 6:21 PM, Ava Nimaee  wrote:
>>
>>> Hi
>>> i used this syntax:
>>>
>>> training/lstmtraining --debug_interval 100 \
>>>   --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
>>>   --model_output ~/tesstutorial/engoutput/base --learning_rate 20e-4 \
>>>   --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
>>>   --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
>>>   --max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log
>>>
>>> and put eng.traineddata on right path but has an error:
>>>
>>> ERROR: Non-existent flag --traineddata
>>>
>>> can you help me?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/30f1bf28-ea15-4999-b9ca-bccfed2be66f%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/a9e00cdf-64d2-4cfe-9ff8-de931c34d798%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVRVO7RT2y9mDzXy6kQ0fXMDUeNp46m-%3DTw8qU%3Dj6eXGw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] ERROR: Non-existent flag --traineddata

2017-08-05 Thread ShreeDevi Kumar
Are you using the latest source of programs from github for building
tesseract?

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Aug 5, 2017 at 6:21 PM, Ava Nimaee  wrote:

> Hi
> i used this syntax:
>
> training/lstmtraining --debug_interval 100 \
>   --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
>   --model_output ~/tesstutorial/engoutput/base --learning_rate 20e-4 \
>   --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
>   --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
>   --max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log
>
> and put eng.traineddata on right path but has an error:
>
> ERROR: Non-existent flag --traineddata
>
> can you help me?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/30f1bf28-ea15-4999-b9ca-bccfed2be66f%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUhc9X0eZgaFy47H60BJNWV0kpDOkw3yckJfNTkz6Lj4w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Failed to load list of training filenames from

2017-08-05 Thread ShreeDevi Kumar
Please see
https://github.com/tesseract-ocr/tessdata/issues/70#issuecomment-320441568

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Aug 5, 2017 at 6:22 PM, Ava Nimaee  wrote:

> we tried but for some word and font, it is not so good and we decied train
> it
>
> On Friday, August 4, 2017 at 7:30:04 PM UTC+4:30, shree wrote:
>>
>> Please try the ocr with new tessdata/best/far.traineddata - farsi -
>> persian and provide your feedback for Ray to improve the training.
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Fri, Aug 4, 2017 at 6:40 PM, Ava Nimaee  wrote:
>>
>>> Thanks alot.
>>> Im so sorry beacuse i strart train tesseract 4.0 for persian and i dont
>>> have any experiance about it. i've tried alot. but i face alot of error.
>>> Many thanks for your assistance in our project
>>>
>>> On Friday, August 4, 2017 at 4:12:34 PM UTC+4:30, shree wrote:

 ​Please check tesseract training wiki for new instructions.

 https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

 Use the latest code from github.​

 ShreeDevi
 
 भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

 On Fri, Aug 4, 2017 at 5:03 PM, Ava Nimaee  wrote:

> Hi sorry i have an error
> can you help me?
> I use this syntax:
> lstmtraining -U ../tesstutorial/englayer_from_eng/eng.unicharset \
>   --script_dir langdata --debug_interval 0 \
>   --continue_from   ../tesstutorial/englayer_from_eng/eng.lstm \
>   --append_index 5 --net_spec '[Lfx256 O1c105]' \
>   --model_output ../tesstutorial/englayer_from_eng/englayer \
>   --train_listfile ../tesstutorial/engtrain/eng.training_files.txt \
>   --eval_listfile ../tesstutorial/engeval/eng.training_files.txt \
>   --max_iterations 5
> but take an error :
> Failed to load list of training filenames from
> ../tesstutorial/engtrain/eng.training_files.txt
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/986d017a-b04
> a-442b-8cfe-877aed950858%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

 --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/406f2fde--4f86-b152-0b4358eaaeb7%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/03bcdac4-ab33-41d7-9428-3799d03e7e46%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 

Re: [tesseract-ocr] Failed to load list of training filenames from

2017-08-04 Thread ShreeDevi Kumar
Please try the ocr with new tessdata/best/far.traineddata - farsi - persian
and provide your feedback for Ray to improve the training.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Aug 4, 2017 at 6:40 PM, Ava Nimaee  wrote:

> Thanks alot.
> Im so sorry beacuse i strart train tesseract 4.0 for persian and i dont
> have any experiance about it. i've tried alot. but i face alot of error.
> Many thanks for your assistance in our project
>
> On Friday, August 4, 2017 at 4:12:34 PM UTC+4:30, shree wrote:
>>
>> ​Please check tesseract training wiki for new instructions.
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>>
>> Use the latest code from github.​
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Fri, Aug 4, 2017 at 5:03 PM, Ava Nimaee  wrote:
>>
>>> Hi sorry i have an error
>>> can you help me?
>>> I use this syntax:
>>> lstmtraining -U ../tesstutorial/englayer_from_eng/eng.unicharset \
>>>   --script_dir langdata --debug_interval 0 \
>>>   --continue_from   ../tesstutorial/englayer_from_eng/eng.lstm \
>>>   --append_index 5 --net_spec '[Lfx256 O1c105]' \
>>>   --model_output ../tesstutorial/englayer_from_eng/englayer \
>>>   --train_listfile ../tesstutorial/engtrain/eng.training_files.txt \
>>>   --eval_listfile ../tesstutorial/engeval/eng.training_files.txt \
>>>   --max_iterations 5
>>> but take an error :
>>> Failed to load list of training filenames from
>>> ../tesstutorial/engtrain/eng.training_files.txt
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/986d017a-b04a-442b-8cfe-877aed950858%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/406f2fde--4f86-b152-0b4358eaaeb7%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUQ8-gY3uWsZ5bVS9dGSfc3GK-61%3DbAGMbP%3DnrQsWOPpQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Failed to load list of training filenames from

2017-08-04 Thread ShreeDevi Kumar
​Please check tesseract training wiki for new instructions.

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

Use the latest code from github.​

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Aug 4, 2017 at 5:03 PM, Ava Nimaee  wrote:

> Hi sorry i have an error
> can you help me?
> I use this syntax:
> lstmtraining -U ../tesstutorial/englayer_from_eng/eng.unicharset \
>   --script_dir langdata --debug_interval 0 \
>   --continue_from   ../tesstutorial/englayer_from_eng/eng.lstm \
>   --append_index 5 --net_spec '[Lfx256 O1c105]' \
>   --model_output ../tesstutorial/englayer_from_eng/englayer \
>   --train_listfile ../tesstutorial/engtrain/eng.training_files.txt \
>   --eval_listfile ../tesstutorial/engeval/eng.training_files.txt \
>   --max_iterations 5
> but take an error :
> Failed to load list of training filenames from
> ../tesstutorial/engtrain/eng.training_files.txt
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/986d017a-b04a-442b-8cfe-877aed950858%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU-tEL-mO%3DEz%3D%2BM4ZOaqz3nhnn8eL2AzgbpJJsC%2BuzKdA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

2017-08-01 Thread ShreeDevi Kumar
Ray has uploaded new traineddata files in
https://github.com/tesseract-ocr/tessdata/tree/master/best

Why don't you first try recognition with that

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Aug 1, 2017 at 1:45 PM,  wrote:

> Hello, Shree:
>
> I'm sorry, but whether can I use more than one unicharset, such as chi_sim
> and eng and so on, to finetune the training?
> Maybe some special characters can be in other unicharsets. If I find
> it/them, maybe I will train my traineddata with more unicharsets, and the
> special characters will be encoded at that time.
>
> Thanks, and hope for your reply.
>
> 在 2017年7月25日星期二 UTC+8下午3:23:08,shree写道:
>>
>> That error is because some characters in your training text are not part
>> of the unicharset of chi_sim.
>>
>> You are trying finetune training which will give error. Replace top layer
>> will work.
>>
>> I suggest that you wait 2-3 weeks for Ray to upload new traineddata for
>> all languages.
>>
>> You can tell us if there are any specific characters missing from
>> existing traineddata .
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, Jul 25, 2017 at 12:46 PM,  wrote:
>>
>>> Hello,
>>>
>>> I apply the command to train my own traineddata:
>>>
>>> lstmtraining --model_output ~/tesstutorial/chituned_from_chisim/chituned \
>>>   --continue_from ~/tesstutorial/chituned_from_chisim/chi_sim.lstm \
>>>   --train_listfile ~/tesstutorial/chitest/chi.training_files.txt \
>>>   --eval_listfile ~/tesstutorial/chitest/chi.training_files.txt \
>>>   --target_error_rate 0.01
>>>
>>> An error appears by Tess4.0 that shown in the following img. The system 
>>> (Tess4.0) says "Can't encode transcript" for text content such as 
>>> "化简(-x2)3的结果是...".
>>> Why? Can you help me?
>>>
>>>
>>> 
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/e2e1d749-a55d-4355-b128-5d0fe2181e19%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/2753f88a-ba89-4164-8271-9eb13207736f%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUKXSiqsVuQenHf%2BCBJ01-XOeGGM8FKNn-G0xH%2B47QCTw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] ERROR: Could not find training text file

2017-07-31 Thread ShreeDevi Kumar
add a line similar to following to your training command, pointing to where
you have your training text

  --training_text ../langdata/eng/eng.training_text \


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jul 31, 2017 at 4:24 PM, Ava Nimaee  wrote:

> Hi . sorry I used this syntax:
> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng
> --linedata_only \
>   --noextract_font_properties --langdata_dir langdata \
>   --tessdata_dir tessdata \
>   --fontlist "Times New Roman," --output_dir engtrain
> Befor that i create boxfile and tif and Ucnicahset_output
> I clone langdata for tesseract v4.0
> but take this error:
>  === Phase I: Generating training images ===
> ERROR: Could not find training text file langdata/eng/eng.training_text
> i can't solve it and i don't know where should i put taining_text.txt
> actually it is a text file that i want train it.
> Thanks for attention.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/a141d688-bc59-4485-b7bc-66ac650ebfd8%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU_zLd1N7aSvfD%3D5wtX3%2BpOeBAnkTgmh47qcwaJfGUWPA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Building tesseract 4.0.0 from master on OS X

2017-07-30 Thread ShreeDevi Kumar
I do not have a MAC so cannot check. But you can try

option "with-training-tools", "Install OCR training tools"

with homebrew install along with the --HEAD option.


Please add a comment to existing mac OS issue on github, if you still face
a problem.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUTaU0xiHVrtAGXjG3k27VPMNatUz7hdKy%2BpBvk26eaCw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Building tesseract 4.0.0 from master on OS X

2017-07-30 Thread ShreeDevi Kumar
Also see
https://github.com/Homebrew/homebrew-core/blob/master/Formula/tesseract.rb

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jul 31, 2017 at 9:32 AM, ShreeDevi Kumar <shreesh...@gmail.com>
wrote:

> Please see the following for the suggested solutions
>
> https://github.com/tesseract-ocr/tesseract/issues/864
> Can't Install Latest Head With Brew
>
> https://github.com/tesseract-ocr/tesseract/issues/830
> 3.05 can't be be built as Standalone Self-contained Tesseract-OCR for Mac
>
> Regarding tessdata_prefix
>
> you can try the following
> either
> EXPORT the location
> or
> give --tessdata-dir as part of command
>
> eg.
>
>  export TESSDATA_PREFIX=/home/shree/tesseract-ocr
>
> tesseract --tessdata-dir=/home/shree/tesseract-ocr testing/phototest.jpg
> testing/phototest-jpg
>
> your /path/to/repos/tesseract/
> should reflect where you have your tessdata files.
>
>
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Sun, Jul 30, 2017 at 11:02 PM, Kevin Schiesser <
> kevin.joseph.schies...@gmail.com> wrote:
>
>> Hi all,
>>
>> I hit 2 walls when trying the build and run tesseract from the latest
>> checkout of master on OS X Sierra. First, I ran into some issues when
>> running make training. After some Makefile hacking I was able to link
>> libpango-1.0, but failed on libgobject-2.0. I couldn't find much about the
>> availability of this library for Macs and stopped there.
>>
>> The second issue is when running the tesseract vanilla OCR binary built
>> from source:
>>
>> TESSDATA_PREFIX=/path/to/repos/tesseract/tessdata tesseract
>> AmazonSonicare.pdf ./
>> Error opening data file /path/to/repos/tesseract/eng.traineddata
>> Please make sure the TESSDATA_PREFIX environment variable is set to your
>> "tessdata" directory.
>> Failed loading language 'eng'
>> Tesseract couldn't load any languages!
>> Could not initialize tesseract.
>>
>> Along the way I had to depart from the Mac Homebrew instructions on the
>> git wiki and pass in gcc/g++ v7.0 to the configure step (the instructions
>> say to use v6.0). That said, the main binary build didn't report any
>> warnings or errors.
>>
>> Does the project intend to support Mac or should I simply use a Linux
>> environment going forward?
>>
>> Thanks much,
>> Kevin
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/96dcaf32-c4d0-4e5b-9f02-c06285bccdbf%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/96dcaf32-c4d0-4e5b-9f02-c06285bccdbf%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUi34LeXv-_UZHoNvYLZmQqtQB0Gd8PjTniv9bmmM_ZRw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Building tesseract 4.0.0 from master on OS X

2017-07-30 Thread ShreeDevi Kumar
Please see the following for the suggested solutions

https://github.com/tesseract-ocr/tesseract/issues/864
Can't Install Latest Head With Brew

https://github.com/tesseract-ocr/tesseract/issues/830
3.05 can't be be built as Standalone Self-contained Tesseract-OCR for Mac

Regarding tessdata_prefix

you can try the following
either
EXPORT the location
or
give --tessdata-dir as part of command

eg.

 export TESSDATA_PREFIX=/home/shree/tesseract-ocr

tesseract --tessdata-dir=/home/shree/tesseract-ocr testing/phototest.jpg
testing/phototest-jpg

your /path/to/repos/tesseract/
should reflect where you have your tessdata files.



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Jul 30, 2017 at 11:02 PM, Kevin Schiesser <
kevin.joseph.schies...@gmail.com> wrote:

> Hi all,
>
> I hit 2 walls when trying the build and run tesseract from the latest
> checkout of master on OS X Sierra. First, I ran into some issues when
> running make training. After some Makefile hacking I was able to link
> libpango-1.0, but failed on libgobject-2.0. I couldn't find much about the
> availability of this library for Macs and stopped there.
>
> The second issue is when running the tesseract vanilla OCR binary built
> from source:
>
> TESSDATA_PREFIX=/path/to/repos/tesseract/tessdata tesseract
> AmazonSonicare.pdf ./
> Error opening data file /path/to/repos/tesseract/eng.traineddata
> Please make sure the TESSDATA_PREFIX environment variable is set to your
> "tessdata" directory.
> Failed loading language 'eng'
> Tesseract couldn't load any languages!
> Could not initialize tesseract.
>
> Along the way I had to depart from the Mac Homebrew instructions on the
> git wiki and pass in gcc/g++ v7.0 to the configure step (the instructions
> say to use v6.0). That said, the main binary build didn't report any
> warnings or errors.
>
> Does the project intend to support Mac or should I simply use a Linux
> environment going forward?
>
> Thanks much,
> Kevin
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/96dcaf32-c4d0-4e5b-9f02-c06285bccdbf%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXHYZbxzHnd9_SnRGg%3DMUvLkBGaQjMUCGKFO56sU-yYBQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Combining tessdata files Error opening unicharset file

2017-07-28 Thread ShreeDevi Kumar
You need to mv or rename the files with por. prefix

then when you use combine_tessdata command it will use all por. files to
create traineddata.

see
https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain_utils.sh

mv ${TRAINING_DIR}/inttemp ${TRAINING_DIR}/${LANG_CODE}.inttemp
mv ${TRAINING_DIR}/shapetable ${TRAINING_DIR}/${LANG_CODE}.shapetable
mv ${TRAINING_DIR}/pffmtable ${TRAINING_DIR}/${LANG_CODE}.pffmtable

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Jul 28, 2017 at 4:23 PM,  wrote:

> This my essay
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/5704cdbc-a6b9-4de7-8396-a39ced1f7331%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWqVySWU0vViL72twkeH%3DWMyYkJ22J06vSRb5PU56exCQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Combining tessdata files Error opening unicharset file

2017-07-27 Thread ShreeDevi Kumar
what command did you use?

make sure that all components are there as listed.

looks like only the unicharset was available for building your traineddata.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jul 27, 2017 at 8:46 PM,  wrote:

> hello,
>
>
> I have already tried this step but finnaly i got this error :
>
>
> Combining tessdata files
> Error: traineddata file must contain at least (a unicharset fileand
> inttemp) OR an lstm file.
> Error combining tessdata files into por.traineddata
> Version string:4.00.00alpha
> 1:unicharset:size=1124, offset=192
> 23:version:size=12, offset=1316
>
>
>
> please can you help me !
>
>
>>
>>
>>
>> ahmed barbouche .
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/0c123c19-01cd-469f-97f7-3e7d0fc331a9%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVJu623Zvh_K1o0ZBykPLxauNfLk%2Bbc3JG1Hfd2qiU3mw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Page segmentation and preserve_interword_space are not working

2017-07-26 Thread ShreeDevi Kumar
Try  'tsv' instead of 'hocr'

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Jul 26, 2017 at 10:30 PM, Prav  wrote:

> Hi,
>
> I am trying to extract tabular data. For this I am converting the image
> into hocr.
> Now this hocr is not coming properly. It first puts the data for one
> column and then for the other. I do not get data which is put row wise and
> column wise so that the extraction comes as a proper table.
>
> I have tried with -psm 5 and with -psm 6 but in both cases the hocr looks
> identical.
>
> I am using tesseract 3.05
>
> even preserve_interword_space set to 1 is not working.
>
> Any help would be useful
>
> For eg
> we have the following in the image
>
> Colulmn 1 Column 2
> X   1
> Y   2
> Z   3
>
> hocr is giving
>
> X
> Y
> Z
> 1
> 2
> 3
>
> I would like the output to be
>
> X 1
> Y 2
> Z 3
>
> Will be grateful for any help and/or ideas
>
> Thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/d2b68f4a-8f1b-473b-bd27-818d9d1a28be%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVktEq97gHgJ4vg%3DWVt%2BiUb1uEy5fhZ-4wkGVcTbXbN0w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Error:Assert failed:in file text2image.cpp, line 428

2017-07-26 Thread ShreeDevi Kumar
Which version of tesseract are you using? Which platform?

Try building the latest code from github and use that.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jul 25, 2017 at 9:02 PM, Ava Nimaee  wrote:

> hi
> sorry but i can't solve this error. when i used  "text2image
> --text=training_text.txt –outputbase=eng.Times New Roman,.exp0
> --font='Times New Roman,' --fonts_dir=/usr/share/fonts"
> show me this :
> Output file missing!
> !FLAGS_outputbase.empty():Error:Assert failed:in file text2image.cpp,
> line 428
> Segmentation fault (core dumped)
> can you please help me?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/67213a5a-a743-4705-8a05-7db4ee4b6a79%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXGuWyjidAeJuZ8FyxHkmO7284zjwLLyxrpSddTp--h_A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Could not find font named AR PL UMing Patched Light

2017-07-26 Thread ShreeDevi Kumar
I do not have this font.

The training is done at Google. They probably use a number of commercial
fonts in addition to freely available fonts. The fonts are not provided as
part of the training data.

You have to get your own set of fonts to train or wait for the new
traineddata by Ray (expected in next few weeks).

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Jul 26, 2017 at 11:09 AM,  wrote:

> Yeah, I know that. But I lack the font of AR PL UMing Patched Light, which
> cannot be found in the Internet.
>
> I'm afraid that I may need to find this package (the font of AR PL UMing
> Patched Light) from you. If you don't mind sharing your resources, thanks
> sincerely.
>
> 在 2017年7月26日星期三 UTC+8上午11:31:23,shree写道:
>>
>> The training process uses the list of fonts from
>> https://github.com/tesseract-ocr/tesseract/blob/master/
>> training/language-specific.sh
>>
>> You need to update it to match the fonts available with you for the
>> script you are training and include the correct location for the fonts
>> directory.
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Wed, Jul 26, 2017 at 7:17 AM,  wrote:
>>
>>> Hello,
>>>
>>> I'm trying to train my own traineddata with Tess4.0 following the
>>> tutorail: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTess
>>> eract-4.00---Replace-Top-Layer
>>>
>>> When executing the command:
>>> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim \
>>> --training_text ../training_data/part.txt \
>>> --linedata_only --noextract_font_properties \
>>> --langdata_dir ../langdata --tessdata_dir ./tessdata \
>>> --output_dir ~/tesstutorial/chisim
>>>
>>> An error appears: "Could not find font named AR PL UMing Patched Light",
>>> showed in the follow img.
>>>
>>> Then I search for the package of "AR PL UMing Patched Light.ttf" with
>>> Baidu, Google and some other search engines, but cannot find the result.
>>>
>>> Can you help me? I don't know if there are other solutions for this
>>> problem.
>>>
>>>
>>> 
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/825ee74a-854f-4a46-b911-3e3c6bd56427%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/bd8a12f7-44e6-4ee2-ab98-cad5506a3091%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXM1PQ7PMsU2e%3DVsomyncMOp-gwFrDCU%3D0gspQgsNT0Vg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Could not find font named AR PL UMing Patched Light

2017-07-25 Thread ShreeDevi Kumar
The training process uses the list of fonts from
https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh

You need to update it to match the fonts available with you for the script
you are training and include the correct location for the fonts directory.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Jul 26, 2017 at 7:17 AM,  wrote:

> Hello,
>
> I'm trying to train my own traineddata with Tess4.0 following the tutorail:
> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---
> Replace-Top-Layer
>
> When executing the command:
> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim \
> --training_text ../training_data/part.txt \
> --linedata_only --noextract_font_properties \
> --langdata_dir ../langdata --tessdata_dir ./tessdata \
> --output_dir ~/tesstutorial/chisim
>
> An error appears: "Could not find font named AR PL UMing Patched Light",
> showed in the follow img.
>
> Then I search for the package of "AR PL UMing Patched Light.ttf" with
> Baidu, Google and some other search engines, but cannot find the result.
>
> Can you help me? I don't know if there are other solutions for this
> problem.
>
>
> 
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/825ee74a-854f-4a46-b911-3e3c6bd56427%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXT33oR-FHNSXrNaap28Y%3Dkq%2Bh%2B4b%2BmLh0Mjkn_Wrq-3g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

2017-07-25 Thread ShreeDevi Kumar
That error is because some characters in your training text are not part of
the unicharset of chi_sim.

You are trying finetune training which will give error. Replace top layer
will work.

I suggest that you wait 2-3 weeks for Ray to upload new traineddata for all
languages.

You can tell us if there are any specific characters missing from existing
traineddata .

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jul 25, 2017 at 12:46 PM,  wrote:

> Hello,
>
> I apply the command to train my own traineddata:
>
> lstmtraining --model_output ~/tesstutorial/chituned_from_chisim/chituned \
>   --continue_from ~/tesstutorial/chituned_from_chisim/chi_sim.lstm \
>   --train_listfile ~/tesstutorial/chitest/chi.training_files.txt \
>   --eval_listfile ~/tesstutorial/chitest/chi.training_files.txt \
>   --target_error_rate 0.01
>
> An error appears by Tess4.0 that shown in the following img. The system 
> (Tess4.0) says "Can't encode transcript" for text content such as 
> "化简(-x2)3的结果是...".
> Why? Can you help me?
>
>
> 
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/e2e1d749-a55d-4355-b128-5d0fe2181e19%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWjrZ0yNfP%2BTcnKyzn9HO3LxBDsSdU%2BeqVg%2BSD_eacUUQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Combine_tessdata command error while training Tesseract4.0

2017-07-24 Thread ShreeDevi Kumar
Is your traineddata file present at  ../tessdata/nor.traineddata?
Is it 4.00 version?

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jul 24, 2017 at 1:47 PM,  wrote:

>  Hello,
>
> I'm trying to train the Tesseract4.0 following the steps in the tutorial:
> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---
> Replacing-Top-Layer-Example
>
> But when I execute the command:
>
> mkdir -p ~/tesstutorial/nor_layer
> $ combine_tessdata -e ../tessdata/nor.traineddata \
> >   ~/tesstutorial/nor_layer/nor.lstm
>
>
> An error message is given by the system, which is shown as following: Not
> extracting /home/robert/tesstutorial/nor_layer/nor.lstm, since this
> component is not present.
>
> Why do I receive this error? The message in the tutorial shows: "Wrote
> /home/shree/tesstutorial/nor_layer/nor.lstm"  represents nor.lstm will be
> written.
> But why the system hint the nor.lstm file not present? Can you help me...
> (Thanks)
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/a867b49d-7535-4260-b1b5-a45ffb533394%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVyZRv37JWMeL2c8A4MmgLEvWHAS5spjG1rLZBq6Ey3dQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Using TessPDFRenderer in tesseract 3.05 in C++

2017-07-21 Thread ShreeDevi Kumar
take a look at  tesseractmain.cpp
.


352  api->GetBoolVariable

("tessedit_create_pdf", );
353  if (b) {
354  bool textonly;
355  api->GetBoolVariable

("textonly_pdf", );
356  renderers->push_back

(new tesseract::TessPDFRenderer
(
357  outputbase, api->GetDatapath
(),
textonly));
358  }
359

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWnZcQVkUf1N9%2BrcKH3bfxzp%3DqeS8ZQ6fCdO6niDpo1NQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Train tess4 LSTM with own images

2017-07-21 Thread ShreeDevi Kumar
currently lstm training is only supported for box/tiff pairs generated by
text2image via tesstrain.sh script.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Jul 21, 2017 at 12:55 PM, Sophea PRUM  wrote:

> Hello,
>
> I'm actually working with tesseract 4. I would like to train tesseract 4
> lstm model by using our existing images.
>
> I do following this link https://github.com/tesseract-ocr/tesseract/wiki/
> TrainingTesseract-4.00
>
> Unfortunately, it provides only the tutorial to train tesseract with text
> and existing font. I did not see any explanation about using own images.
>
> Appreciate your help
> Thanks
> Sophea
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/155478f9-34c0-4877-a6d2-a7e0ce43fd66%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVhMHp7qCQd1cz0S-66q5XnmACDV3YvfKPCQL6iAGae-w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Using TessPDFRenderer in tesseract 3.05 in C++

2017-07-21 Thread ShreeDevi Kumar
Are you able to create pdfs using commandline?

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Jul 21, 2017 at 12:09 PM, Roger Jefferson <
roger.t.jeffer...@gmail.com> wrote:

> I want to use tesseract 3.05 to generate searchable PDF programmatically
> in C++. Here is my code:
>
> int main(int argc, const char * argv[]){
> tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
> // Initialize tesseract-ocr with English, without specifying tessdata path
> if (api->Init(NULL, "eng")) {
> fprintf(stderr, "Could not initialize tesseract.\n");
> exit(1);
> }
> Pix *image = pixRead("/Users/user1/pictures/page1.png");
>
> tesseract::TessResultRenderer* renderer = new 
> tesseract::TessPDFRenderer("/Users/user1/Documents/", 
> "/usr/local/share/tessdata");
> api->ProcessPage(image, 0, "/Users/user1/Documents/page1_pdf", NULL, 0, 
> renderer);
> api->End();
>
> pixDestroy();
> delete renderer;
>
> return 0;}
>
>
> The problem is everytime I get to api->ProcessPage() I keep getting
> assertion error:
>
> size_used_ > 0:Error:Assert failed:in file ../ccutil/genericvector.h, line
> 696
>
> Can anyone help? What's wrong? Is there a better way to generate PDF
> output?
>
> Thanks in advance
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/5e66b13b-5dce-4920-bbc8-dc16e201ef62%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVpF2Ssam8%3DVtz5BiYkbMtSGewgpa0mO-Usa_tbgf1fXQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: train a new font for language of persian

2017-07-18 Thread ShreeDevi Kumar
I would suggest that you wait a few weeks more for Ray to upload the new
traineddata files for tesseract4.0.0beta and then try it.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Jul 19, 2017 at 10:30 AM, Ava Nimaee  wrote:

> sorry about me delay i should train some words like as لا
> in previous version like as this word detect wrong. and now i want
> understand in version 4.0 we need to font detection or no we can trani any
> font together?
> and is there a bach file for tesseract 4.0 ? i can have it ?
> thanks alot
>
>
> On Friday, May 5, 2017 at 7:01:03 PM UTC+4:30, shree wrote:
>>
>> There is already farsi/persian traineddata for tesseract-ocr 4.0-alpha at
>> https://github.com/tesseract-ocr/tessdata/raw/master/fas.traineddata
>>
>> Have you given it a try? Which font do you want to add to it?
>>
>> On Thursday, May 4, 2017 at 6:06:09 PM UTC+5:30, Ava Nimaee wrote:
>>>
>>> hi every one. i want start to use tesseract to first. i need learn about
>>> where i shuld start? i want train a new font for persian language .but i
>>> have been confused.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/3591b7fc-6e1c-4c36-ad0b-fdb5a7615af2%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXmky6pq-qTOqazCcw4kP%3DEAQXB070BqnTdOMCbcT3Wsw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] tesseract 4 skips over some text

2017-07-18 Thread ShreeDevi Kumar
Please see
https://github.com/tesseract-ocr/tesseract/issues/681#issuecomment-303027906

You can try changing those constants to see if you get any improvement.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jul 18, 2017 at 11:32 PM, Chris Hawley  wrote:

> The file that i am running OCR on
>
> https://drive.google.com/file/d/0B-iKKP8eIvdgZkhObUVXUVJ1N28/
> view?usp=sharing
>
> Before anyone asks, it's part of the CIA's Crest Dataset. I noticed
> tesseract seems to skip over some text. The command that I am using is
>
> E:\Tesseract\build\bin\Release\tesseract.exe --psm 1 --oem 1
>  "D:\split\Folder 001\1946-06-21.tiff" test.txt
>
> The output is
>
> 21 June 1946
>
> MEMORANDUM For SUPERVISING AGENT,
> U. S. SECRET SERVICE,
> WHITE Hous®.
>
>
>
> 1. - It is requested that a White House pass be issued to
> Lieutenant General Hoyt S. VANDENBERG, Director of Central Intel-
>
> ligence.
>
>
>
> 2. - In connection with his official duties, it is necessary
> for General Vandenberg to visit the White House frequently,.
>
>
>
>
>
>
>
> 3% His physical description is:
>
> Height =-- 6 feet.
> Hair «-- _ @FAY ,
> Eyes -- _- blue.
>
> Enclosed herewith is his photograph.
>
> THOMAS F, CULLEN
> Captain, USNR
> Asgistant to the Director.
>
>
>
> if you notice, it skips over the "weight -- 165 lbs" line. I wasn't sure
> if this qualified as a bug. Is there anything that I can do to improve the
> results so that line is included?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/ef8c2b5c-0f42-4c6e-9d22-1e8fd821571e%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU2e8n1A5Jvj7DrTP4gh2k8kr%3DqYOL9jxLXfr9fhiRiqQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Fwd: [tesseract-ocr/tesseract] Tag a new version for LSTM 4.0 (#995)

2017-07-11 Thread ShreeDevi Kumar
​Forwarding update by Ray.


-- Forwarded message --
From: theraysmith 
Date: Wed, Jul 12, 2017 at 5:55 AM
Subject: Re: [tesseract-ocr/tesseract] Tag a new version for LSTM 4.0 (#995)
To: tesseract-ocr/tesseract 


I'm about ready to update the traineddatas. I have a training run almost
complete, and with accuracy that meets with my satisfaction.
There are a few regressions, but not too serious.
First though, I have to get some code reviewed in Google, and then make
some commits to github to match the new traineddatas.
Before that, there is the matter of a major pull...

Here's what's coming:

- Fix to issue 653: New components in traineddata file for the
unicharset, recoder and version string. Backwards compatible change, so the
LSTM component can still read older files.
- Change in training system. The above change makes open source training
impossible. Will add a new program to build a starter traineddata from a
unicharset and optional word lists.
- New "normalization" code to clean corpus text in all languages. That
was a big part of the work.
- Improvements to the trained networks to improve accuracy on single
characters and single words.
- 2 parallel sets of tessdata. "best" and "fast". "Fast" will exceed the
speed of legacy Tesseract in real time, provided you have the required
parallelism components, and in total CPU only slightly slower for English.
Way faster for most non-latin languages, while being <5% worse than "best"
Only "best" will be retrainable, as "fast" will be integer.

I have other stuff that is still incomplete, but that is a good list for
now.

BTW, in case you hadn't noticed, there was a breaking change that made old
lstmf files unusable. That was needed to fix LSTM for OSD. It has to know
the language of each training sample.
The new traineddatas will mostly be smaller than the older ones, as they
won't contain the legacy components, and no bigram dawgs are needed.


-- 
Ray.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub

,

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWPWhxWpMC-Csx-o3Nd7hvh%3DteJbvfPC2JkL9excAp2CA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] While extracting numbers tesseract makes a lot of errors

2017-07-09 Thread ShreeDevi Kumar
If using 3.05 branch

try configs such as

digits
whitelist

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Jul 9, 2017 at 7:36 PM, Prav  wrote:

> Any suggestions for any configuration which i can use to extract numbers
> from scan documents correctly Tesseract makes errors such as O for 0 and $
> for 4 etc.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/9a244db8-cbf1-432a-b5dc-d15d8d8bf5c0%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW0dusojNc6OXhmisTUcCJLS_7vKnwcw-Q8wJwT9QbOKA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Tesseract-ocr on Redhat 5

2017-07-07 Thread ShreeDevi Kumar
​for 3.05 don't you need to checkout the 3.05 branch??​
master is for 4.0 development.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Jul 7, 2017 at 9:22 PM, akhil katpally 
wrote:

> Steven .. Here is the list of commands to install tesseract 3.05 on Redhat
> 6 ... Hope this should work for Redhat 5 ... if not please try to downgrade
> the tesseract and try ..
>
> sudo yum update
>sudo yum install wget unzip
>sudo yum install gcc gcc-c++ make
>sudo yum install libjpeg-devel libpng-devel libtiff-devel
> zlib-devel
>sudo yum install libtool
>sudo yum install autoconf automake
>
>
>   sudo yum whatprovides libtool
>   (Install the latest version)
>   sudo yum whatprovides libtiff
>   sudo yum install libtiff-4.0.3-27.el7_3.x86_64
>
>Install autoconf-archive from: http://rpm.pbone.net/ind
> ex.php3/stat/4/idpl/23652016/dir/centos_6/com/autoconf-
> archive-2012.04.07-7.3.noarch.rpm.html
>   Download it manually and copy it into the ec2 instance.
>   sudo rpm -ivh autoconf-archive-2012.04.07-7.3.noarch.rpm
>
>
>
>   Installing leptonica:
>   wget http://www.leptonica.com/source/leptonica-1.74.1.tar.gz
>   tar xvf leptonica-1.74.1.tar.gz
>   cd leptonica-1.74.1
>   ./configure
>make
>   sudo make install
>   sudo ldconfig
>
>
>
>   Installing Tesseract:
>   cd ..
>   wget https://github.com/tesseract-o
> cr/tesseract/archive/master.zip
>   unzip master.zip
>cd tesseract-master/
>   sudo ./autogen.sh
>   export LIBLEPT_HEADERSDIR=/usr/local/include
>   export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
>   export LD_LIBRARY_PATH=/usr/local/lib
>   ./configure --with-extra-includes=/usr/local/include
> --with-extra-libraries=/usr/local/lib
>   make
>   sudo make install
>   sudo ldconfig
>
>   loading the training data for tesseract:
>   Dowload the tessdata and copy only the contents into the
> tesseract-master/tessdata
>   cd ..
>   sudo wget https://github.com/tesseract-o
> cr/tessdata/archive/master.zip
>   sudo unzip master.zip
>   Note: copy the contents into the tesseract-master/tessdata
>   export TESSDATA_PREFIX=/usr/local/share/
>   sudo mv ~/tesseract-master/tessdata/*  /usr/local/share/tessdata/
>
>   test: tesseract --version
>
>   for reference check: https://github.com/tess
> eract-ocr/tesseract/wiki/Compiling
>
> On Tuesday, June 27, 2017 at 1:09:48 PM UTC-7, Steven Heydendahl wrote:
>>
>> Is tesseract 3.05 available for redhat 5?  Can we just rpm it or do we
>> have to add a repository?
>>
>> On Tuesday, June 27, 2017 at 2:07:59 PM UTC-6, zdenop wrote:
>>>
>>> 2.04 is too old.
>>> Please ask install 3.05 + language data (at least eng and osd)
>>>
>>> Zdenko
>>>
>>> On Tue, Jun 27, 2017 at 9:58 PM, Steven Heydendahl 
>>> wrote:
>>>
 Hi all,

 Novice here.  I had made a request at my company to install
 tesseract-ocr on our redhat 5 OS.

 They ended up installing the following:
 rpm -Vp "tesseract-2.04-1.el5.rf.x86_64.rpm"

 which is apparently an older version of tesseract.  Now, that completed
 successfully however, every time I try to run tesseract I get an error
 message.  Even when I just try to do the following:
 tesseract --version

 the response is:
 tesseract:Error:Usage:tesseract imagename outputbase [-l lang]
 [configfile [[+|-]varfile]...]

 and if I try to run tesseract on an image:
 tesseract OCRTest.png text l- eng
 read_variables_file:Can't open 
 /usr/share/tesseract/tessdata/configs/engUnable
 to load unicharset file /usr/share/tesseract/tessdata/eng.unicharset


 I do not know if this was a botched install, if we are missing
 dependencies, or if tesseract is just not compatible with redhat 5.  Any
 help is greatly appreciated!

 Thanks,
 Steve

 --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit https://groups.google.com/d/ms
 gid/tesseract-ocr/7b21b154-f878-4d87-80f2-2458093fed7b%40goo
 glegroups.com
 
 .
 For more 

Re: [tesseract-ocr] Simple images, trying to get the better results

2017-07-05 Thread ShreeDevi Kumar
Try with a higher dpi for output images - 300 or 600.

Also check out other psm values.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Jul 5, 2017 at 8:13 AM, Marcos Benatti  wrote:

> Hello I'm trying to use OCR tesseract to recognize some letters in a image.
>
> I did a convert using imagemagick and image seems to be good but its not
> enough
>
> The original images:
>
> 
>
> 
>
> 
>
> 
>
>
> The command used with imagemagick to convert
>
> convert input.jpg -fuzz 50% -fill black -opaque black -bordercolor
> white -border 2 -fill black -draw "color 0,0 floodfill" -alpha off -negate
> -units pixelsperinch -density 72 output.jpg
>
>
> The result images:
>
> 
>
> 
>
> 
>
> 
>
> The OCR tesseract command:
>
> $ tesseract output.jpg out -psm 7
>
> Output/result:
>
> Text: AUGU -> AUOU
>
> Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
> Page 1
>
> Text: VEGU -> VOR-OU
>
> Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
> Page 1
>
> Text: EGUV -> E6UV
>
> Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
> Page 1
>
> Text: USEA -> USSOEA
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/cbd3f657-7970-4b7d-9a63-03323b82a401%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXq-kVnZEzhQ5ELJFiT5YXQfexNR8_9cWXjdzQHWhG%3Dzg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Image file not found

2017-07-04 Thread ShreeDevi Kumar
see

https://groups.google.com/forum/#!topic/tesseract-ocr/l918_ouIH98

https://groups.google.com/forum/#!topic/tesseract-ocr/hOvr20u71dY

https://groups.google.com/forum/#!topic/tesseract-ocr/nr095u8w7iU

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXJQT3%3Db_L0CB0rBVLc%2BnqE2vAT7gX0XdaFYFqBPdmrdw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Image file not found

2017-07-02 Thread ShreeDevi Kumar
you can browse source code via doxygen at
https://ub-mannheim.github.io/tesseract/a00113_source.html
for page segmentation,
follow the links.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Jul 2, 2017 at 3:11 PM, H  wrote:

> Thanks for your reply.
>
> Do you know "where exactly" in Leptonica?
> I would like to take a look at its scripts...
>
> I have realized that Image Processing in Tesseract is limited.
> So, I would first like to see "what exactly" is done internally (by
> default) before Tesseract is called.
> Do you know where is the script or program that contains the limited Image
> Processing steps within Tesseract?
>
> Thanks,
>
>
> On Sunday, July 2, 2017 at 10:12:17 AM UTC+2, H wrote:
>>
>>
>> Through Homebrew, I have installed the Tesseract OCR engine on my Mac.
>>
>> All the directories (*jpeg, leptonica, libpng, libtiff, openssl,
>> tesseract*) are now installed in */usr/local/Cellar*
>>
>> Before putting an image in the *Cellar* directory, when I try the
>> following at the command line, obviously it fails:
>>
>> $ tesseract image.png outcome
>>
>> So, because there is no such image, I get the following messages:
>>
>>
>> Error in fopenReadStream: file not found
>>
>> Error in findFileFormat: image file not found
>>
>> Error during processing.
>>
>> Where are the programs/scripts that generate these messages? I can only
>> find *include* files in the installed Tesseract directory...
>>
>> Where are the files that contain these error messages if the image was
>> not found, etc...?
>>
>> Where are the scripts/programs that perform *image pre-processing* (such
>> as segmentation, binarization, etc...) before Tesseract actually does the
>> OCR on the image?
>>
>> Thanks,
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/12e7a994-d44e-4eee-bda4-694e89abf7a7%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVt%3D3_0uoBcw3vJ6_qkaL6cK-hCmJ29%2B20Y%3DiK5a0-c%3DA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Image file not found

2017-07-02 Thread ShreeDevi Kumar
These errors are from leptonica.

The image processing within tesseract is limited.

It is preferable to preprocess image before calling tesseract.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Jul 2, 2017 at 1:42 PM, H  wrote:

>
> Through Homebrew, I have installed the Tesseract OCR engine on my Mac.
>
> All the directories (*jpeg, leptonica, libpng, libtiff, openssl,
> tesseract*) are now installed in */usr/local/Cellar*
>
> Before putting an image in the *Cellar* directory, when I try the
> following at the command line, obviously it fails:
>
> $ tesseract image.png outcome
>
> So, because there is no such image, I get the following messages:
>
>
> Error in fopenReadStream: file not found
>
> Error in findFileFormat: image file not found
>
> Error during processing.
>
> Where are the programs/scripts that generate these messages? I can only
> find *include* files in the installed Tesseract directory...
>
> Where are the files that contain these error messages if the image was not
> found, etc...?
>
> Where are the scripts/programs that perform *image pre-processing* (such
> as segmentation, binarization, etc...) before Tesseract actually does the
> OCR on the image?
>
> Thanks,
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/40c1541c-3ec1-4062-b809-f7305ce0439f%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVrUe%3DrGB%3DdhJojHNODd8JBJPEwJS-D%2BR_LSd%3DvrpodTA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Errors on all commandline options

2017-06-29 Thread ShreeDevi Kumar
--psm works for 3.05.01 and 4.00.00alpha

try -psm

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jun 29, 2017 at 8:20 PM, Brian  wrote:

> trying to run
>
> tesseract infile.tif outfile --psm 6
>
> and the output I get is
>
> Tesseract Open Source OCR Engine v3.04.01 with Leptonica
> read_params_file: Can't open 6
> Page 1
>
>
> I get that same out put for all options --oem or --psm and any number I
> specify.
>
> Any suggestions would be appreciated.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/3a061e0a-a4c0-461f-bb31-3494b2becd91%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX8yvrnaYva3BV5gMWDOU3iV5idPBseh6eyeb1HfeYe7g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Tesseract library 6.0.4 queries

2017-06-27 Thread ShreeDevi Kumar
>tesseract library version 6.0.4

Tesseract-ocr stable version is 3.05.01 and development branch is for 4.0.

Are you referring to a different project that uses tesseract?

For licensing, see
https://github.com/tesseract-ocr/tesseract#license

For performance, see
https://github.com/tesseract-ocr/tesseract/issues/943



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jun 27, 2017 at 11:54 AM, Chetan Lande  wrote:

> Hello,
>   Please update on this.
>
> On Saturday, 17 June 2017 20:07:08 UTC+5:30, Chetan Lande wrote:
>>
>> Hi,
>>  We are using tesseract library version 6.0.4 to scan image(cards) in
>> our project. As we tested this is match our requirement. And we found it is
>> a open source but we have few queries.
>>
>>  1) Is there any Licensing cost to use this library for commercial
>> purpose?
>>
>>  2) For processing its taking long time, Is there any way to increase the
>> performance of scanning.
>>
>>
>>  Thanks,
>>  Chetan
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/f21c4186-ad78-4bdc-bf77-c5b45d0d40be%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU5xZojXj1S0K-JE3qfGLXaCe4kORpxdXXZO2QA_tXtbA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Need help training Simplified Chinese.

2017-06-26 Thread ShreeDevi Kumar
On Tue, Jun 27, 2017 at 10:18 AM, Clement wrote:

> I downloaded the alpha source code from the link below:
> https://github.com/tesseract-ocr/tesseract/releases/tag/4.00.00alpha
>
> I installed using the following commands:
> $ ./autogen.sh
> $ ./configure PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
> LIBLEPT_HEADERSDIR=/usr/local/include LDFLAGS="-L/usr/local/lib"
> CFLAGS="-I/usr/local/include" --with-extra-includes=/usr/local/include
> --with-extra-libraries=/usr/local/lib --bindir=/usr/local/sbin
> $ sudo make install
> $ make
> $ make training
> $ sudo make training-install
>
> I also tried the dev version from Nov 24, 2016 but the behavior was the
> same:
> https://github.com/tesseract-ocr/tesseract/releases/tag/4.00.00dev
>
> Would you suggest I try again with the latest codes?
>
>
> ​Yes, please.

A number of fixes have been applied since those tags - over 500 commits to
master branch.

So if you want to try the LSTM engine, use the latest code - follow
instructions in
https://github.com/tesseract-ocr/tesseract/wiki/Compiling-%E2%80%93-GitInstallation

get the code from
   git clone https://github.com/tesseract-ocr/tesseract.git

--
​

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVZ6ieS%3DvpG9X1PWyhPHvggHzfKujPTytctF_O03GeGbw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] ./configure failling for me

2017-06-26 Thread ShreeDevi Kumar
Also see https://github.com/tesseract-ocr/tesseract/issues/919
related to building on Centos

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jun 27, 2017 at 8:54 AM, ShreeDevi Kumar <shreesh...@gmail.com>
wrote:

> Have you tried:
>
> ensure that autoconf-archive is installed. Don't forget to run
> ./autogen.sh after the installation of autoconf-archive.
>
> as per
> https://github.com/tesseract-ocr/tesseract/wiki/Compiling
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Tue, Jun 27, 2017 at 12:14 AM, Luke Quinn <lqu...@vt.edu> wrote:
>
>> So i am trying to get tesseract installed following this guide (the git
>> guide failed at the same point as well) https://medium.com/@lucas63/in
>> stalling-tesseract-3-04-in-ubuntu-14-04-1dae8b748a32.
>>
>> Now I get to the point where i run
>>
>> ./configure for tesseract but i get this error output
>>
>> ./configure: line 4188: syntax error near unexpected token `-mavx,'
>> ./configure: line 4188: `AX_CHECK_COMPILE_FLAG(-mavx, avx=true,
>> avx=false)'
>>
>> I am running on CentOS but i don't think thats the problem. I am not sure
>> which log files will be help but if you think you can help me post ill be
>> check frequently. Note I already looked at issue #647 and tried what the
>> suggested but that did not work.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/a8edd6fd-7e32-46c0-82d9-3f4d62b1785d%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/a8edd6fd-7e32-46c0-82d9-3f4d62b1785d%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXvifDwxxebgNsQFO4ZuDrdffdCvErrUH%3DCJxcG8vvBaQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] ./configure failling for me

2017-06-26 Thread ShreeDevi Kumar
Have you tried:

ensure that autoconf-archive is installed. Don't forget to run
./autogen.sh after
the installation of autoconf-archive.

as per
https://github.com/tesseract-ocr/tesseract/wiki/Compiling

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jun 27, 2017 at 12:14 AM, Luke Quinn  wrote:

> So i am trying to get tesseract installed following this guide (the git
> guide failed at the same point as well) https://medium.com/@lucas63/
> installing-tesseract-3-04-in-ubuntu-14-04-1dae8b748a32.
>
> Now I get to the point where i run
>
> ./configure for tesseract but i get this error output
>
> ./configure: line 4188: syntax error near unexpected token `-mavx,'
> ./configure: line 4188: `AX_CHECK_COMPILE_FLAG(-mavx, avx=true, avx=false)'
>
> I am running on CentOS but i don't think thats the problem. I am not sure
> which log files will be help but if you think you can help me post ill be
> check frequently. Note I already looked at issue #647 and tried what the
> suggested but that did not work.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/a8edd6fd-7e32-46c0-82d9-3f4d62b1785d%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUfcg0wWd5ss8VsZdtoWSv%2BB-Fea%3DzvoMASaw-X%2B%2Buwfw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Need help training Simplified Chinese.

2017-06-25 Thread ShreeDevi Kumar
>> I installed Tesseract 4.00alpha on Linux.

How did you install it?

Did you use the latest code from github?

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Jun 25, 2017 at 8:18 PM, Clement  wrote:

> Thanks for your reply. I have another question related to the oem option
> you mentioned. Is it for the training command (tesstrain.sh) or the
> recognition command (tesseract)?
>
> I installed Tesseract 4.00alpha on Linux. When I ran tesseract on an
> image, I got the old format (3.x version) that's without the extra spaces
> but the recognition quality was poor. I've no other version of Tesseract
> installed on the same box.
>
> I tried to specify the "--oem 1" option but it didn't work:
> $ tesseract 001a3.png 001a3 -l chi_sim --oem 1
> read_params_file: Can't open 1
> Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
>
>
> On Wednesday, June 21, 2017 at 11:40:22 PM UTC-7, Clement wrote:
>>
>> I am new to Tesseract-OCR and need help in training the engine to
>> recognize Simplified Chinese texts.
>>
>> I just installed Tesseract 4.00Alpha on Windows 10:
>>
>> $ tesseract --version
>> tesseract 4.00.00alpha
>>  leptonica-1.74.1
>>   libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 :
>> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0
>>
>> I have 3 images containing a Simplified Chinese sentence of different
>> sizes:
>>
>> chi_sim.Microsoft_Yahei.exp1.tif (small)
>> chi_sim.Microsoft_Yahei.exp2.tif (medium)
>> chi_sim.Microsoft_Yahei.exp3.tif (large)
>>
>> I ran Tesseract to recognize the texts in the images using the commands
>> below:
>>
>> $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp1.tif
>> chi_sim.Microsoft_Yahei.exp1a
>> $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp2.tif
>> chi_sim.Microsoft_Yahei.exp2a
>> $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp3.tif
>> chi_sim.Microsoft_Yahei.exp3a
>>
>> Tesseract was able to recognize the texts in the large image perfectly.
>> It missed the last "period" symbol in the medium image, and failed to
>> recognize a number of characters in the small image.
>>
>> I'd like to train Tesseract to be able to recognize
>> chi_sim.Microsoft_Yahei.exp1.tif and chi_sim.Microsoft_Yahei.exp2.tif. I
>> created box files for both images as chi_sim.Microsoft_Yahei.exp1.box
>> and chi_sim.Microsoft_Yahei.exp2.box using jTessBoxEditor.
>>
>> The Windows version of Tesseract 4.0 I installed didn't come with
>> tesstrain.sh. I downloaded the source and was able to extract the training
>> commands. The documentation mentioned about LSTM but I couldn't find any
>> LSTM call within the tesstrain.sh script. Anyway, I ran the extracted
>> commands as below ($TESS_LANG is the path of the langdata folder.):
>>
>> = Phase I: Generating training images =
>> $ unicharset_extractor -D ./chi_sim chi_sim.Microsoft_Yahei.exp1.box
>> chi_sim.Microsoft_Yahei.exp2.box
>>
>> = Phase UP: Generating unicharset and unichar properties files =
>> $ set_unicharset_properties -U ./chi_sim/unicharset -O
>> ./chi_sim/chi_sim.unicharset -X ./chi_sim/chi_sim.xheights
>> --script_dir=$TESS_LANG
>>
>> = Phase D: Generating Dawg files =
>> $ wordlist2dawg -r 1 $TESS_LANG/chi_sim/chi_sim.wordlist
>> ./chi_sim/chi_sim.word-dawg ./chi_sim/chi_sim.unicharset
>>
>> = Phase E: Extracting features =
>>
>> $ tesseract chi_sim.Microsoft_Yahei.exp2.tif
>> chi_sim.Microsoft_Yahei.exp2 box.train $TESS_LANG/chi_sim/chi_sim.config
>> $ tesseract chi_sim.Microsoft_Yahei.exp1.tif
>> chi_sim.Microsoft_Yahei.exp1 box.train $TESS_LANG/chi_sim/chi_sim.config
>>
>> = Phase C: Clustering feature prototypes (cnTraining) =
>> $ cntraining -D ./chi_sim chi_sim.Microsoft_Yahei.exp1.tr
>> chi_sim.Microsoft_Yahei.exp2.tr
>>
>> = Phase M : Clustering microfeatures (mfTraining) =
>> $ mftraining -D ./chi_sim/ -U ./chi_sim/chi_sim.unicharset -O
>> ./chi_sim/chi_sim.mfunicharset -F $TESS_LANG/font_properties -X
>> ./chi_sim/chi_sim.xheights chi_sim.Microsoft_Yahei.exp1.tr
>> chi_sim.Microsoft_Yahei.exp2.tr
>>
>> = Making final traineddata file =
>> $ cp $TESS_LANG/chi_sim/chi_sim.config ./chi_sim/.
>>
>> Add "chi_sim." to files "inttemp", "normproto", "pffmtable", and
>> "shapetable"
>>
>> $ combine_tessdata ./chi_sim/chi_sim.
>>
>> $ cp ./chi_sim/chi_sim.traineddata $TESSDATA_PREFIX/tessdata/chi_
>> sim_1.traineddata
>>
>> ===
>>
>> I reran Tesseract on the 3 images using the commands below:
>>
>> $  tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp1.tif
>> chi_sim.Microsoft_Yahei.exp1b
>>
>> $  tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp2.tif
>> chi_sim.Microsoft_Yahei.exp2b
>>
>> $  tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp3.tif
>> chi_sim.Microsoft_Yahei.exp3b
>>
>> The large image still produces perfect result. The medium image gives the
>> same result as before missing a "period" symbol. The small 

Re: [tesseract-ocr] Trainer GUI for Tesseract version 4.0

2017-06-24 Thread ShreeDevi Kumar
Take a look at
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 for
an overview of  training for 4.0. Follow the tutorials to get a feel of the
training process - you can try for English as well as Malayalam.

In terms of  trainer GUI, I think that it will probably work for `fine
tune` training.

Areas where you could contribute re 4.0 training would be in creating box
files in 4.0 format from scanned images.

Also look at jtessboxeditor which offers tesseract training gui - though
not for 4.0.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Jun 24, 2017 at 7:25 PM, Nalin Linux 
wrote:

>
>
> On Saturday, June 24, 2017 at 7:07:32 PM UTC+5:30, shree wrote:
>>
>> You can update it for 3.05.01
>>
>> I am quit impressed with Tesseract 4.0. And it's working fine for my
> language (Malayalam). Is this trained data for version 4.0 listed in
> https://github.com/tesseract-ocr/tessdata
> created from old language data itself ? (https://github.com/tesseract-
> ocr/langdata).   What about creating a training GUI for version 4.0 ? I
> have two months of time at my disposal for developing such a GUI.
> Please let me know the relevance of this project or else let me switch to
> another relevant free and opensource project.
>
> Thanking you Nalin.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/0929cd89-69c7-4693-be98-14286633d83c%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWLr%2Be8cT%2B%2BWerrz0%2BS1FH1TFHAJnePLJk17LrdrbULgA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Trainer GUI for Tesseract version 4.0

2017-06-24 Thread ShreeDevi Kumar
You can update it for 3.05.01

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Jun 24, 2017 at 6:59 PM, Nalin Linux 
wrote:

> I where developing a Tesseract trainer GUI which makes Tesseract training
> easier for end users and research scholars.
> It was working for version 3.04. Now I am concerned about the relevance of
> this trainer GUI for Tesseract version 4.0.
>
> Please watch following video which shows my trainer GUI for version 3.04
> https://www.youtube.com/watch?v=qLpCld4cdtk#t=3m25s
>
> Please let me know the necessity of upgrading this trainer for Tesseract
> 4.0.
> Any suggestion are welcome.
>
> Tesseract Trainer GUI Github Page : https://github.com/Nalin-x-
> Linux/lios-3
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/1ec6e44b-d358-4a51-bf34-d9ab3e50642e%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV4k2PNu145zMfzmBt-60OYoGKJgp0JtEg1RfAAt6T8Dw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] I am looking for the best way to OCR scan sports scoreboards (such as stadium scoreboards) for such items as time and scores

2017-06-23 Thread ShreeDevi Kumar
Take a look at https://www.unix-ag.uni-kl.de/~auerswal/ssocr/

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jun 22, 2017 at 11:58 PM,  wrote:

> I am experimenting with Tesseract
>  which does not
> do well but maybe I can train it.  Any hints if this is possible or a
> better way of getting times and scores from sport scoreboards.
>
> scoreboards similar to
>
> https://www.google.com/search?biw=1440=771=1=
> isch=1=sport+scoreboard=sport+scoreboard_l=img.
> 3..0j0i8i30k1.5714.5714.0.5918.1.1.0.0.0.0.69.69.1.1.0..
> ..0...1.1.64.img..0.1.68.QYk-CQtiVTQ#imgrc=_2nu8Vy18KK8jM:
>
> thank you
> Frank
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/6e511ad4-b42f-487c-bd80-4ffecc5349ac%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXXmRCSczRMhnm5uEnBZ31GN7b%3Ds9et5XEi41y%3DHQ5_FA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Fine Tuning Iterations

2017-06-22 Thread ShreeDevi Kumar
>what is the number of the iterations that will for sure cover the 40 lstmf
files?

It will depend on number of lines in each file eg. If each file has 1000
lines, then 40,000 iterations should cover all files once.

You can use   --target_error_rate 0.01  instead of number of iterations as
a guide for how long to train.



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jun 22, 2017 at 3:15 PM, Ibr  wrote:

> Hi,
>
> if I want to run the command:
>
> training/lstmtraining --model_output ~/tesstutorial/full_japanese/new \
>   --continue_from ~/tesstutorial/extracted_lstm/jpn.lstm \
>   --train_listfile ~/tesstutorial/jpntrain/jpn.training_files.txt \
>   --max_iterations 10
>
> how can I match the --max_iterations so all lstmf files inside the
> training_files.txt  will be trained against? I mean if I have 40 lstmf
> files inside training_files.txt , what is the number of the iterations
> that will for sure cover the 40 lstmf files?
>
> also if I trained against set of lstmf files, then I got a new set, I can
> continue the training against the new set without repeating the first set,
> correct? and if yes, all what I have to do is changing the path to the new
> set of lstmf files inside the training_files.txt file, while keeping the
> --continue_from as it is, correct?
>
> thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/76179e5f-6a8b-4cb8-aa22-e4df1baa0d1b%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX50PyAW-53s6y4Y1CWQ%2BQFfg_VQzNO8ONLjoqkfNqbyA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Need help training Simplified Chinese.

2017-06-22 Thread ShreeDevi Kumar
Your best bet for improving recognition is to preprocess the small and
medium images to larger size.
Please  see https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

Tesseract 4.00.00alpha currently has two different ocr engines in it. The
legacy tesseract engine is accessible with --oem 0 and new LSTM engine is
accessible with --oem 1.
The option --oem 2 will use both together and --oem 3 will use the one
which has been defined as default.

The training process that you followed builds a new model for the legacy
engine, not LSTM.

If you notice the output for your first test, you will notice that there
are spaces after each character in the OCRed text, which has been reported
as an issue with the LSTM model. The legacy model does not add the extra
spaces but the accuracy is lower.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jun 22, 2017 at 11:15 AM, Clement  wrote:

> I am new to Tesseract-OCR and need help in training the engine to
> recognize Simplified Chinese texts.
>
> I just installed Tesseract 4.00Alpha on Windows 10:
>
> $ tesseract --version
> tesseract 4.00.00alpha
>  leptonica-1.74.1
>   libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 :
> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0
>
> I have 3 images containing a Simplified Chinese sentence of different
> sizes:
>
> chi_sim.Microsoft_Yahei.exp1.tif (small)
> chi_sim.Microsoft_Yahei.exp2.tif (medium)
> chi_sim.Microsoft_Yahei.exp3.tif (large)
>
> I ran Tesseract to recognize the texts in the images using the commands
> below:
>
> $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp1.tif
> chi_sim.Microsoft_Yahei.exp1a
> $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp2.tif
> chi_sim.Microsoft_Yahei.exp2a
> $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp3.tif
> chi_sim.Microsoft_Yahei.exp3a
>
> Tesseract was able to recognize the texts in the large image perfectly. It
> missed the last "period" symbol in the medium image, and failed to
> recognize a number of characters in the small image.
>
> I'd like to train Tesseract to be able to recognize
> chi_sim.Microsoft_Yahei.exp1.tif and chi_sim.Microsoft_Yahei.exp2.tif. I
> created box files for both images as chi_sim.Microsoft_Yahei.exp1.box and
> chi_sim.Microsoft_Yahei.exp2.box using jTessBoxEditor.
>
> The Windows version of Tesseract 4.0 I installed didn't come with
> tesstrain.sh. I downloaded the source and was able to extract the training
> commands. The documentation mentioned about LSTM but I couldn't find any
> LSTM call within the tesstrain.sh script. Anyway, I ran the extracted
> commands as below ($TESS_LANG is the path of the langdata folder.):
>
> = Phase I: Generating training images =
> $ unicharset_extractor -D ./chi_sim chi_sim.Microsoft_Yahei.exp1.box
> chi_sim.Microsoft_Yahei.exp2.box
>
> = Phase UP: Generating unicharset and unichar properties files =
> $ set_unicharset_properties -U ./chi_sim/unicharset -O
> ./chi_sim/chi_sim.unicharset -X ./chi_sim/chi_sim.xheights
> --script_dir=$TESS_LANG
>
> = Phase D: Generating Dawg files =
> $ wordlist2dawg -r 1 $TESS_LANG/chi_sim/chi_sim.wordlist
> ./chi_sim/chi_sim.word-dawg ./chi_sim/chi_sim.unicharset
>
> = Phase E: Extracting features =
>
> $ tesseract chi_sim.Microsoft_Yahei.exp2.tif chi_sim.Microsoft_Yahei.exp2
> box.train $TESS_LANG/chi_sim/chi_sim.config
> $ tesseract chi_sim.Microsoft_Yahei.exp1.tif chi_sim.Microsoft_Yahei.exp1
> box.train $TESS_LANG/chi_sim/chi_sim.config
>
> = Phase C: Clustering feature prototypes (cnTraining) =
> $ cntraining -D ./chi_sim chi_sim.Microsoft_Yahei.exp1.tr
> chi_sim.Microsoft_Yahei.exp2.tr
>
> = Phase M : Clustering microfeatures (mfTraining) =
> $ mftraining -D ./chi_sim/ -U ./chi_sim/chi_sim.unicharset -O
> ./chi_sim/chi_sim.mfunicharset -F $TESS_LANG/font_properties -X
> ./chi_sim/chi_sim.xheights chi_sim.Microsoft_Yahei.exp1.tr
> chi_sim.Microsoft_Yahei.exp2.tr
>
> = Making final traineddata file =
> $ cp $TESS_LANG/chi_sim/chi_sim.config ./chi_sim/.
>
> Add "chi_sim." to files "inttemp", "normproto", "pffmtable", and
> "shapetable"
>
> $ combine_tessdata ./chi_sim/chi_sim.
>
> $ cp ./chi_sim/chi_sim.traineddata $TESSDATA_PREFIX/tessdata/chi_
> sim_1.traineddata
>
> ===
>
> I reran Tesseract on the 3 images using the commands below:
>
> $  tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp1.tif
> chi_sim.Microsoft_Yahei.exp1b
>
> $  tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp2.tif
> chi_sim.Microsoft_Yahei.exp2b
>
> $  tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp3.tif
> chi_sim.Microsoft_Yahei.exp3b
>
> The large image still produces perfect result. The medium image gives the
> same result as before missing a "period" symbol. The small image actually
> returns worse result detecting wrong number of words from the image.
>
> I am attaching a zip files 

Re: [tesseract-ocr] Re: unicharset_extractor extracting zero values

2017-06-20 Thread ShreeDevi Kumar
Master branch currently includes the legacy engine. So you can easily build
your custom traineddata using the following command (modify it for your
fonts location, training text, font name etc)


training/tesstrain.sh \
  --fonts_dir ~/.fonts \
  --tessdata_dir ../tessdata \
  --training_text ../langdata/eng/eng.training_text \
  --langdata_dir ../langdata \
  --lang eng  \
  --exposures "0"\
  --fontlist "Supercell Magic" \
  --output_dir ~/tesstutorial/engtest

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jun 20, 2017 at 7:45 PM, David Barishev  wrote:

> After several testing, i have found mixed results.
>
> If i download leptonica 1.74.4, build it, and than build master brach, it
> works fine.
> With the same version of leptonica, the 3.05.01 release failes with the
> following error:
>
>
> libtool: link: g++ -g -O2 -std=c++11 -o .libs/tesseract
> tesseract-tesseractmain.o  ./.libs/libtesseract.so -lrt -lpthread
> /usr/bin/ld: tesseract-tesseractmain.o: undefined reference to symbol
> 'lept_free'
> /usr/local/lib//liblept.so.5: error adding symbols: DSO missing from
> command line
> collect2: error: ld returned 1 exit status
> Makefile:598: recipe for target 'tesseract' failed
> make[2]: *** [tesseract] Error 1
> make[2]: Leaving directory '/home/david/project/tesseract-3.05.01/api'
> Makefile:489: recipe for target 'all-recursive' failed
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory '/home/david/project/tesseract-3.05.01'
> Makefile:398: recipe for target 'all' failed
> make: *** [all] Error 2
>
>
> On the docs, it states the *minimum *version to tesseract to build, so
> the latest should be able to build even with tesserac 3.05.01.
>
> Can you please try to build version 3.05.01 ?
>
>
> On Tuesday, June 20, 2017 at 11:03:25 AM UTC+3, shree wrote:
>>
>> > Do you know why my tesseract isnt compiling ? I would really love a
>> updated version on my ubuntu.
>>
>> Not sure. I haven't built 3.05 branch. For master, I follow the usual
>> autotools method.
>>
>> Have you also built leptonica? Make sure you don't have any old
>> leptonica version already.
>>
>> Make sure you use either autotools for both or cmake for both tesseract
>> and leptonica. Use the latest sources for both from github.
>>
>>
>>
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, Jun 20, 2017 at 1:20 PM, David Barishev 
>> wrote:
>>
>>> Thank you so much for your help, i found my error, i need to set script
>>> dir to the langdata folder when runnning set_unicharset_properties.
>>> Do you know why my tesseract isnt compiling ? I would really love a
>>> updated version on my ubuntu.
>>>
>>> Thank you again.
>>>
>>>
>>> On Tuesday, June 20, 2017 at 6:59:58 AM UTC+3, shree wrote:


 See https://github.com/tesseract-ocr/tesseract/issues/318
 regarding the unicharset format

 I was able to do regular tesseract training (not lstm) using tesseract
 4.00.00 version from github master and create new unicharset and
 traineddata with your box/tiff pair. The output on the same tiff file is
 enclosed.

 I think you will get better results with the training input text having
 interword spaces.

>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/5f0cc56c-ddb0-473d-80b8-0330edc2fa33%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/241d308c-7441-4860-a091-1235fb45c082%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 

Re: [tesseract-ocr] bad result on tesseract(4.0) with lstm

2017-06-20 Thread ShreeDevi Kumar
Your input image quality needs to be improved.

Also test with --oem 1 alone.

Please test with
https://github.com/tesseract-ocr/tesseract/blob/master/testing/hebtypo.jpg
and see if you get similar results.

for hocr, just adding hocr to the command line should work - as long as you
have the hocr config file in your tessdata directory.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jun 20, 2017 at 1:05 PM, לאה למד  wrote:

>
> hi
> * Attached line from the original image
>
>  command  *tesseract file.tiff output --oem 2 -l heb --psm 6*
> resulte *"אומדן / שווי ההתקשרות: 6 ₪ לפני מע"מ. ₪"*
>
>  command  *tesseract file.tiff output --oem 0 -l heb --psm 6*
> resulte *"אןמדן ושווי ההתקשרות: 16,656 ₪ לפניימע"מ. ₪”"*
>
> So for people that don't read hebrew i can tell that extract the sentence
> are more good with the lstm but for a unknown reason the extract number
> absolutely wrong
> any ideas?
>
> and not connect question , how i can do "hocr" in  the new tesseract?
>  thank you
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/bfa31f55-a8b4-43f5-9049-417cf0f20229%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXbuF54fZw80k4y4T1EtunuHy_-Z%2Ba-cCiJeTXbfsP%2BBg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: unicharset_extractor extracting zero values

2017-06-20 Thread ShreeDevi Kumar
> Do you know why my tesseract isnt compiling ? I would really love a
updated version on my ubuntu.

Not sure. I haven't built 3.05 branch. For master, I follow the usual
autotools method.

Have you also built leptonica? Make sure you don't have any old leptonica
version already.

Make sure you use either autotools for both or cmake for both tesseract and
leptonica. Use the latest sources for both from github.





ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jun 20, 2017 at 1:20 PM, David Barishev  wrote:

> Thank you so much for your help, i found my error, i need to set script
> dir to the langdata folder when runnning set_unicharset_properties.
> Do you know why my tesseract isnt compiling ? I would really love a
> updated version on my ubuntu.
>
> Thank you again.
>
>
> On Tuesday, June 20, 2017 at 6:59:58 AM UTC+3, shree wrote:
>>
>>
>> See https://github.com/tesseract-ocr/tesseract/issues/318
>> regarding the unicharset format
>>
>> I was able to do regular tesseract training (not lstm) using tesseract
>> 4.00.00 version from github master and create new unicharset and
>> traineddata with your box/tiff pair. The output on the same tiff file is
>> enclosed.
>>
>> I think you will get better results with the training input text having
>> interword spaces.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/5f0cc56c-ddb0-473d-80b8-0330edc2fa33%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVaVN0uF1ziQqLhm-X_dYys_ZAQvHEUrHS5tJnzpmjLCw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Tesseract 4.00.00alpha Windows doesn't find image files

2017-06-20 Thread ShreeDevi Kumar
Please show the command line you used followed by the error.

You may have to put filename in quotes if there are spaces in it.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jun 19, 2017 at 9:32 PM, J. Karjalainen  wrote:

> Hi!
> I need your help please!
>
> I'm trying to run Tesseract 4.00.00alpha (used the installer) on Win7 sp1
> 32-bit and it doesn't find any files even if it's in the same folder with
> the tesseract.exe.
>
> I always get:
>
> Error in *fopenReadStream: file not found*
>
>
> C:\Program Files\Tesseract-OCR>tesseract --version
> tesseract 4.00.00alpha
>  leptonica-1.74.1
>   libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 :
> libtiff 4
> .0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0
>
> I also have C:\Program Files\Tesseract-OCR in my path.
>
> Any help would be appreciated! :)
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/7edf5c18-9279-4bf2-8c2b-9ddda18a334f%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXXnYVZPPOrbfXrP044aqV1d8bOEBmD0QtB4KHhCaW4PA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to improve the recognition of receipt (text not in words dictionary)

2017-06-20 Thread ShreeDevi Kumar
Please see
https://github.com/tesseract-ocr/tesseract/issues/960#issuecomment-305966719

on stable 3.0x you can try by adding your product catalog to eng.user-words
file and check for improvement.

In my unit test, it seemed to apply the words from user dict.

Alternately, you can also try withthe development version tesseract 4,
--oem 1 directly - I don't think user-words work with it, but it might give
you better recognition.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jun 19, 2017 at 8:31 PM, Laura  wrote:

> Hi, I’m new on tesseract. I’m trying to recognize receipts. Since on
> receipts, lots of text are not dictionary words. I disabled the
> dictionaries,  it increased the recognition rate, but it’s still low, I’d
> like to create my own dictionary with the product catalog.
>
> Is there someone who can give the tutorial to do it ?
>
> Many thanks !
>
> Laura
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/ee61c476-8aee-4d58-a3a7-2bbf5d292eb8%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX2uosfJUSzbMpLzintm8%2BjRyzFNQNe%3DfiWc%3DEQ85ObtA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] error building 3.05.01

2017-06-19 Thread ShreeDevi Kumar
Sorry, I haven't built 3.05.01.

Hope others can help.


On Tue, Jun 20, 2017 at 2:32 AM, David Barishev  wrote:

> hey, i try to build tesseract from source now, and after i have
> built Leptonica, i couldn't build tesseract with this error :
>
> /bin/bash ../libtool  --tag=CXX   --mode=link g++  -g -O2 -std=c++11   -o
> tesseract tesseract-tesseractmain.o libtesseract.la  -lrt -lpthread
> libtool: link: g++ -g -O2 -std=c++11 -o .libs/tesseract
> tesseract-tesseractmain.o  ./.libs/libtesseract.so -lrt -lpthread
> /usr/bin/ld: tesseract-tesseractmain.o: undefined reference to symbol
> 'lept_free'
> //usr/local/lib/liblept.so.5: error adding symbols: DSO missing from
> command line
> collect2: error: ld returned 1 exit status
> Makefile:598: recipe for target 'tesseract' failed
> make[2]: *** [tesseract] Error 1
> make[2]: Leaving directory '/home/david/project/tesseract-3.05.01/api'
> Makefile:489: recipe for target 'all-recursive' failed
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory '/home/david/project/tesseract-3.05.01'
> Makefile:398: recipe for target 'all' failed
> make: *** [all] Error 2
>
>
> Any idea why ?
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX5qTc_EVSrSW2gw6xGRKn%2BF5%2BRSbr1FV9X7GGT0%3D%2BfQg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] unicharset_extractor extracting zero values

2017-06-19 Thread ShreeDevi Kumar
I would also suggest that you add spaces between words in your input text,

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jun 19, 2017 at 9:19 PM, ShreeDevi Kumar <shreesh...@gmail.com>
wrote:

> ​You could also try running training on your windows pc with 3.05.01 using
> tesstrain.sh using `git for windows` which will provide you a shell for
> running ​bash scripts.
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Mon, Jun 19, 2017 at 9:05 PM, ShreeDevi Kumar <shreesh...@gmail.com>
> wrote:
>
>> Where do you have your source files for english langdata?
>>
>> If it is in a directory such as ../langdata/eng/
>> then put the common.unicharset, latin.unicharset and font_properties etc
>> in
>> ../langdata
>>
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Mon, Jun 19, 2017 at 8:34 PM, David Barishev <david7...@gmail.com>
>> wrote:
>>
>>> Thanks for the replay,
>>> If you mean if i have the latin and common unicharset in the tessdata
>>> direcotry(  /usr/share/tesseract-ocr/tessdata ),i have downloaded them
>>> and placed them in the directory and still getting the same behavior.
>>> I have also tried doing it from my windows machine which has 3.05
>>> version, and had same behavior .
>>>
>>> On Monday, June 19, 2017 at 2:58:40 PM UTC+3, shree wrote:
>>>>
>>>> do u have the common and latin unicharset in ur langdata directory.
>>>>
>>>> See https://github.com/tesseract-ocr/langdata
>>>>
>>>> Try to build the latest 3.05.01 version.
>>>>
>>>> ShreeDevi
>>>> 
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>> On Mon, Jun 19, 2017 at 3:23 PM, David Barishev <davi...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello all!
>>>>> Im trying to train tesseract to recognize a new font in English (
>>>>> supercell-magic).
>>>>> I have created a .tif file and matching .box file using jTessBoxEditor
>>>>> ( eng.supercell-magic.exp0.tif and  eng.supercell-magic.exp0.box ),
>>>>> and trained tesseract with them.
>>>>>
>>>>> Here is tesseracts's output:
>>>>> $ tesseract eng.supercell-magic.exp0.tif eng.supercell-magic.exp0
>>>>> box.train
>>>>> Tesseract Open Source OCR Engine v3.04.01 with Leptonica
>>>>> Page 1
>>>>> row xheight=30, but median xheight = 37.5455
>>>>> APPLY_BOXES:
>>>>>Boxes read from boxfile:1559
>>>>>Found 1559 good blobs.
>>>>> Generated training data for 34 words
>>>>> Page 2
>>>>> APPLY_BOXES:
>>>>>Boxes read from boxfile:1677
>>>>>Found 1677 good blobs.
>>>>> Generated training data for 34 words
>>>>> Page 3
>>>>> APPLY_BOXES:
>>>>>Boxes read from boxfile:1362
>>>>>Found 1362 good blobs.
>>>>> Generated training data for 28 words
>>>>>
>>>>>
>>>>> So the next step is to extract the characters
>>>>> using unicharset_extractor.
>>>>> There was a normal output for it :
>>>>> $ unicharset_extractor eng.supercell-magic.exp0.box
>>>>> Extracting unicharset from eng.supercell-magic.exp0.box
>>>>> Wrote unicharset file ./unicharset.
>>>>>
>>>>> But when i view the file, it's mostly 0 and 255, which is not like the
>>>>> example in the wiki
>>>>> <https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract#an-example-of-the-unicharset-file>
>>>>> :
>>>>> An example of the unicharset file
>>>>>
>>>>> 110
>>>>> NULL 0 NULL 0
>>>>> N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
>>>>> Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
>>>>> 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
>>>>> 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
>>>>> a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
&

Re: [tesseract-ocr] unicharset_extractor extracting zero values

2017-06-19 Thread ShreeDevi Kumar
​You could also try running training on your windows pc with 3.05.01 using
tesstrain.sh using `git for windows` which will provide you a shell for
running ​bash scripts.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jun 19, 2017 at 9:05 PM, ShreeDevi Kumar <shreesh...@gmail.com>
wrote:

> Where do you have your source files for english langdata?
>
> If it is in a directory such as ../langdata/eng/
> then put the common.unicharset, latin.unicharset and font_properties etc
> in
> ../langdata
>
>
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Mon, Jun 19, 2017 at 8:34 PM, David Barishev <david7...@gmail.com>
> wrote:
>
>> Thanks for the replay,
>> If you mean if i have the latin and common unicharset in the tessdata
>> direcotry(  /usr/share/tesseract-ocr/tessdata ),i have downloaded them
>> and placed them in the directory and still getting the same behavior.
>> I have also tried doing it from my windows machine which has 3.05
>> version, and had same behavior .
>>
>> On Monday, June 19, 2017 at 2:58:40 PM UTC+3, shree wrote:
>>>
>>> do u have the common and latin unicharset in ur langdata directory.
>>>
>>> See https://github.com/tesseract-ocr/langdata
>>>
>>> Try to build the latest 3.05.01 version.
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Mon, Jun 19, 2017 at 3:23 PM, David Barishev <davi...@gmail.com>
>>> wrote:
>>>
>>>> Hello all!
>>>> Im trying to train tesseract to recognize a new font in English (
>>>> supercell-magic).
>>>> I have created a .tif file and matching .box file using jTessBoxEditor (
>>>>  eng.supercell-magic.exp0.tif and  eng.supercell-magic.exp0.box ), and
>>>> trained tesseract with them.
>>>>
>>>> Here is tesseracts's output:
>>>> $ tesseract eng.supercell-magic.exp0.tif eng.supercell-magic.exp0
>>>> box.train
>>>> Tesseract Open Source OCR Engine v3.04.01 with Leptonica
>>>> Page 1
>>>> row xheight=30, but median xheight = 37.5455
>>>> APPLY_BOXES:
>>>>Boxes read from boxfile:1559
>>>>Found 1559 good blobs.
>>>> Generated training data for 34 words
>>>> Page 2
>>>> APPLY_BOXES:
>>>>Boxes read from boxfile:1677
>>>>Found 1677 good blobs.
>>>> Generated training data for 34 words
>>>> Page 3
>>>> APPLY_BOXES:
>>>>Boxes read from boxfile:1362
>>>>Found 1362 good blobs.
>>>> Generated training data for 28 words
>>>>
>>>>
>>>> So the next step is to extract the characters
>>>> using unicharset_extractor.
>>>> There was a normal output for it :
>>>> $ unicharset_extractor eng.supercell-magic.exp0.box
>>>> Extracting unicharset from eng.supercell-magic.exp0.box
>>>> Wrote unicharset file ./unicharset.
>>>>
>>>> But when i view the file, it's mostly 0 and 255, which is not like the
>>>> example in the wiki
>>>> <https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract#an-example-of-the-unicharset-file>
>>>> :
>>>> An example of the unicharset file
>>>>
>>>> 110
>>>> NULL 0 NULL 0
>>>> N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
>>>> Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
>>>> 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
>>>> 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
>>>> a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
>>>> ...
>>>>
>>>>
>>>> Mine looks more like this:
>>>>
>>>> 74
>>>> NULL 0 NULL 0
>>>> Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0# Joined [4a 6f 69 6e 
>>>> 65 64 ]
>>>> |Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0   # Broken
>>>> t 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # t [74 ]
>>>> h 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # h [68 ]
>>>> a 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # a [61 ]
>>>> n 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # n [6e ]
>>>> P 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0

Re: [tesseract-ocr] unicharset_extractor extracting zero values

2017-06-19 Thread ShreeDevi Kumar
Where do you have your source files for english langdata?

If it is in a directory such as ../langdata/eng/
then put the common.unicharset, latin.unicharset and font_properties etc in
../langdata



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jun 19, 2017 at 8:34 PM, David Barishev  wrote:

> Thanks for the replay,
> If you mean if i have the latin and common unicharset in the tessdata
> direcotry(  /usr/share/tesseract-ocr/tessdata ),i have downloaded them
> and placed them in the directory and still getting the same behavior.
> I have also tried doing it from my windows machine which has 3.05 version,
> and had same behavior .
>
> On Monday, June 19, 2017 at 2:58:40 PM UTC+3, shree wrote:
>>
>> do u have the common and latin unicharset in ur langdata directory.
>>
>> See https://github.com/tesseract-ocr/langdata
>>
>> Try to build the latest 3.05.01 version.
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Mon, Jun 19, 2017 at 3:23 PM, David Barishev 
>> wrote:
>>
>>> Hello all!
>>> Im trying to train tesseract to recognize a new font in English (
>>> supercell-magic).
>>> I have created a .tif file and matching .box file using jTessBoxEditor (
>>>  eng.supercell-magic.exp0.tif and  eng.supercell-magic.exp0.box ), and
>>> trained tesseract with them.
>>>
>>> Here is tesseracts's output:
>>> $ tesseract eng.supercell-magic.exp0.tif eng.supercell-magic.exp0
>>> box.train
>>> Tesseract Open Source OCR Engine v3.04.01 with Leptonica
>>> Page 1
>>> row xheight=30, but median xheight = 37.5455
>>> APPLY_BOXES:
>>>Boxes read from boxfile:1559
>>>Found 1559 good blobs.
>>> Generated training data for 34 words
>>> Page 2
>>> APPLY_BOXES:
>>>Boxes read from boxfile:1677
>>>Found 1677 good blobs.
>>> Generated training data for 34 words
>>> Page 3
>>> APPLY_BOXES:
>>>Boxes read from boxfile:1362
>>>Found 1362 good blobs.
>>> Generated training data for 28 words
>>>
>>>
>>> So the next step is to extract the characters using unicharset_extractor.
>>> There was a normal output for it :
>>> $ unicharset_extractor eng.supercell-magic.exp0.box
>>> Extracting unicharset from eng.supercell-magic.exp0.box
>>> Wrote unicharset file ./unicharset.
>>>
>>> But when i view the file, it's mostly 0 and 255, which is not like the
>>> example in the wiki
>>> 
>>> :
>>> An example of the unicharset file
>>>
>>> 110
>>> NULL 0 NULL 0
>>> N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
>>> Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
>>> 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
>>> 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
>>> a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
>>> ...
>>>
>>>
>>> Mine looks more like this:
>>>
>>> 74
>>> NULL 0 NULL 0
>>> Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Joined [4a 6f 69 6e 
>>> 65 64 ]
>>> |Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0# Broken
>>> t 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # t [74 ]
>>> h 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # h [68 ]
>>> a 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # a [61 ]
>>> n 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # n [6e ]
>>> P 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # P [50 ]
>>> o 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # o [6f ]
>>> e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # e [65 ]
>>> : 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # : [3a ]
>>> r 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # r [72 ]
>>> l 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # l [6c ]
>>> i 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # i [69 ]
>>> 1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # 1 [31 ]
>>> N 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # N [4e ]
>>>
>>> Why is that ? Thanks in advances.
>>>
>>> Im using ubuntu 16.04 with tesseract version:
>>>
>>> tesseract 3.04.01
>>>  leptonica-1.73
>>>   libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 
>>> 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
>>>
>>>  I have attached the box and tiff file and the data file, and the 
>>> unicharset file.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/cd052525-9eb7-4527-b75b-82e1a687997d%40goo
>>> glegroups.com
>>> 

Re: [tesseract-ocr] unicharset_extractor extracting zero values

2017-06-19 Thread ShreeDevi Kumar
do u have the common and latin unicharset in ur langdata directory.

See https://github.com/tesseract-ocr/langdata

Try to build the latest 3.05.01 version.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jun 19, 2017 at 3:23 PM, David Barishev  wrote:

> Hello all!
> Im trying to train tesseract to recognize a new font in English (
> supercell-magic).
> I have created a .tif file and matching .box file using jTessBoxEditor ( 
> eng.supercell-magic.exp0.tif
> and  eng.supercell-magic.exp0.box ), and trained tesseract with them.
>
> Here is tesseracts's output:
> $ tesseract eng.supercell-magic.exp0.tif eng.supercell-magic.exp0 box.train
> Tesseract Open Source OCR Engine v3.04.01 with Leptonica
> Page 1
> row xheight=30, but median xheight = 37.5455
> APPLY_BOXES:
>Boxes read from boxfile:1559
>Found 1559 good blobs.
> Generated training data for 34 words
> Page 2
> APPLY_BOXES:
>Boxes read from boxfile:1677
>Found 1677 good blobs.
> Generated training data for 34 words
> Page 3
> APPLY_BOXES:
>Boxes read from boxfile:1362
>Found 1362 good blobs.
> Generated training data for 28 words
>
>
> So the next step is to extract the characters using unicharset_extractor.
> There was a normal output for it :
> $ unicharset_extractor eng.supercell-magic.exp0.box
> Extracting unicharset from eng.supercell-magic.exp0.box
> Wrote unicharset file ./unicharset.
>
> But when i view the file, it's mostly 0 and 255, which is not like the
> example in the wiki
> 
> :
> An example of the unicharset file
>
> 110
> NULL 0 NULL 0
> N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
> Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
> 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
> 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
> a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
> ...
>
>
> Mine looks more like this:
>
> 74
> NULL 0 NULL 0
> Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0   # Joined [4a 6f 69 6e 65 64 ]
> |Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # Broken
> t 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0# t [74 ]
> h 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0# h [68 ]
> a 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0# a [61 ]
> n 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0# n [6e ]
> P 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0# P [50 ]
> o 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0# o [6f ]
> e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0# e [65 ]
> : 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0# : [3a ]
> r 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0# r [72 ]
> l 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0# l [6c ]
> i 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0# i [69 ]
> 1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0# 1 [31 ]
> N 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0# N [4e ]
>
> Why is that ? Thanks in advances.
>
> Im using ubuntu 16.04 with tesseract version:
>
> tesseract 3.04.01
>  leptonica-1.73
>   libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 
> 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
>
>  I have attached the box and tiff file and the data file, and the unicharset 
> file.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/cd052525-9eb7-4527-b75b-82e1a687997d%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUMF9J-LDE6SZr6C1ZZka5H8fLzho5wwKOmKdh0y7EV6Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Newbie: Trying to scan IBM Selectric "script" typeface

2017-06-16 Thread ShreeDevi Kumar
Glad it worked for you.

4.0 LSTM version is still under active development.

I am curious to know whether you 'cloned' the repository for latest version
or used the source from https://github.com/tesseract-ocr/tesseract/releases.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Jun 17, 2017 at 3:01 AM, Carl S. Gutekunst 
wrote:

>
>
> On Friday, June 16, 2017 at 3:50:24 AM UTC-7, shree wrote:
>>
>> Which version of tesseract are you using?
>>
>>
> Problem resolved. I was running 3.03, which is what came with my distro.
> The output looked like:
>
> The éanmen and hit wéfie, Jacob and Many thbont, had a beautéfiut We
> daughten, Atma. When I became ofi age to noam anound the yand, Atma would
> come
> out 06 the houte and ptay with me. She uted to chate me ate anound the
> yand.
>
> vs my new build v4.00.00alpha:
>
> The and his wife, Jacob and Mary Gibbons, had a beautiful Little
> daughter, Alma. When I became of age to roam around the yard, Alma would
> come
> out of the house and play with me. She used to chase me all around the
> yard.
>
> Now I feel like a doofus; one should always make sure they're running the
> latest code before asking questions on public forums. That said, it took a
> couple of hours to get everything patched/installed/upgraded, so it was
> very reassuring seeing your example.
>
> Thank you!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/713360c8-ecb9-401b-bd4d-a3339502%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX0Chtxqf3m%3DU3YLHTy3bZRs26KFLzFt5WKmtG_wOsq_w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Newbie: Trying to scan IBM Selectric "script" typeface

2017-06-16 Thread ShreeDevi Kumar
Which version of tesseract are you using?

Using the latest code from github with eng.traineddata I get the following:

tesseract AutoRedCow_01.png stdout --psm 3 --oem 1 -l eng
Hi! My name is Cow,. Not just any kind of cow, but
Cow spelled with a capital 'C'. No, 1 never had any
other name - Zike Nelly on Daisy, Zike most cows.
Sometimes they would call me "Per Pastor Sine Kauh" -
{or those who don't understand the New Fane Zingo,
that's "The Pastor's Cow." But somewhere along the
way I got to be known as the Red Cow. Why aed I



could never understand. The chicken's combs were
red. They said Helen's hair was red. But my fur? 1 had some white patches,
mome brown or maroon patches, but red? Or am 1 color blind?

Be that as it may, over the years there have been so many stories told
about me and what I1 consider my rather dull and uneventful Life, that 1
have
decided to have my autobiography written just to set the record straight.
The
exaggerations I've heard by those Gutekunst siblings have given me something
to bee{ about. It's about time 1 blow my own horns before they milk all
those
atomries day.

1 uas bon on a {farm near Beechwood, Wisconsin, This is up the road a piece,
maybe about four miles, from New Fane. My mother was a respected member of
the
herd and my father pointed with pride at his many and beautiful of{spiing,
not
to mention his Lovely harem. _

The {armer and his wife, Jacob and Mary Gibbons, had a beautiful Little
daughter, Afma. When I became of age to roam around the yard, Alma would
come
out of the house and play with me. She used to chase me atl around the yard.
Of course, I acted as if I was afraid of her and would run just fast enough
that she couldn't catch me. Sometimes ashe would even Zet me come into the
house, when her mother wasn't Looking. This went on for quite some time
until
one day 1, by accident, Zeft a few calf chips behind on the kitchen fLZoor.
What
happened to me and what she said 1 just can't repeat in polite company.

_ Another time she chased me {first, Then 1 decided to turn tables and chase
her. Ama turned around and started to sun but didn't realize that there was
a
trough filled with water right behind her. When she turned around she feAZl
head {irst into the trough.


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Jun 16, 2017 at 12:44 PM, Carl S. Gutekunst 
wrote:

> I am trying to OCR a document that was typed using an IBM Selectric
> "Script" font ball. The result is very poor; I'm inferring tesseract
> doesn't know about this font. Is the font something that can be tweaked
> from a config var? Do I need to train tesseract for this font?
>
> Sorry for the utter newbie question. A pointer to the right place to start
> would be much appreciated.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/33c2150b-fcd7-4160-af2f-0e784e63e990%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXUCwxD1H9X9nzd9mry27tisLn5TQ8LT93gv8ZBVvvBHA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] large char set language training

2017-06-16 Thread ShreeDevi Kumar
Yes, there is a method for rendering synthetic training data from
training_text and fonts via text2image program and tesstrain.sh script.

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract

https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-tesstrain.sh

Which version of tesseract are you using?

I would suggest that you try the latest version built from github with the
Chinese traineddata and then do the training.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Jun 16, 2017 at 3:06 PM, Richard Foo  wrote:

> Dear all,
>
> I am new to tesseract. When I train a large char set language like
> Chinese, I have no idea which step I should use the char set(over 7000
> char) I prepared. Currently, I consider it as a training set by converting
> all_char.txt to tiff files. Therefore, I have a image training data of a
> single font which can be used for making box files.
>
> p.s: is there any methods(softwares) for rendering synthetic training data
> from text except scanning or printing?
>
> thanks,
>
> Richard
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/5aee4e28-4bb4-460a-8d27-c9ff3a8a3bd0%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVEYOmfcTOu1scTg4rviKG1WKkGkhzq0q083hxYB1hQZw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to regenerate the training text

2017-06-15 Thread ShreeDevi Kumar
You can also see https://ancientgreekocr.org/ for Nick White's method of
creating training data for Ancient Greek.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Jun 16, 2017 at 8:18 AM, ShreeDevi Kumar <shreesh...@gmail.com>
wrote:

> >Where are these scripts, or how can I otherwise generate training text
> from dictionary/corpus data?
>
> These are (most probably) internal scripts at Google which have not been
> open sourced.
>
> Please see https://groups.google.com/forum/#!searchin/tesseract-
> ocr/training$20text%7Csort:date/tesseract-ocr/-B0mWBwki5w/zuR4R6AGAgAJ
> which has Ray's comments regarding the generation of training text.
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Thu, Jun 15, 2017 at 7:05 PM, Dingyuan Wang <abcdoyle...@gmail.com>
> wrote:
>
>> Dear all,
>>
>> I'm trying to generate a training text (chi_sim) for training tesseract
>> because I have a better dictionary and unigram/bigram data than the
>> defaults. I've found the following comments in training/language-specific.
>> sh
>>
>> (line 845)
>> # Set language-specific values for several global variables, including
>> #   ${TEXT_CORPUS}
>> #  holds the text corpus file for the language, used in phase F
>> #   ${FONTS[@]}
>> #  holds a sequence of applicable fonts for the language, used in
>> #  phase F & I. only set if not already set, i.e. from command line
>> #   ${TRAINING_DATA_ARGUMENTS}
>> #  non-default arguments to the training_data program used in phase T
>> #   ${FILTER_ARGUMENTS} -
>> #  character-code-specific filtering to distinguish between scripts
>> #  (eg. CJK) used by filter_borbidden_characters in phase F
>> #   ${WORDLIST2DAWG_ARGUMENTS}
>> #  specify fixed length dawg generation for non-space-delimited lang
>> # TODO(dsl): We can refactor these into functions that assign FONTS,
>> # TEXT_CORPUS, etc. separately.
>>
>> So I suppose there are scripts called training_data (phrase T)
>> and filter_borbidden_characters (sic, phrase F) to create the training
>> text from some wordlists and unigram/bigram frequency data.
>>
>> Where are these scripts, or how can I otherwise generate training text
>> from dictionary/corpus data?
>>
>> Thanks.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/9a5c68ce-43d5-449e-81c1-ff7237133053%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/9a5c68ce-43d5-449e-81c1-ff7237133053%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX%2BXYv4%3D1GrrGjaPpxmjVz7zDzCqrkzTzOEVRemXtzx6Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


<    1   2   3   4   5   6   7   8   >