from:"ShreeDevi Kumar"

Re: [tesseract-ocr] Tesseract 4 training related issue

2018-06-15 Thread ShreeDevi Kumar

Are you using images and box files? Does your box file have boxes for
spaces between words?

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Fri, Jun 15, 2018 at 12:42 PM pranaya mhatre 
wrote:

> Hi,
>
> I trained tesseract 4 many times on images by fine tuning english model,
> but after training tesseract wont give space between two words. How should
> i resolve spacing problem ?
>
> And how should i train tesseract for detecting text boxes appropriately
> for italic fonts ?
>
> Thank you
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/adcfff72-4bb2-4900-9332-300beb8b0c2b%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWqqJh3_o-sA8Mhr0OFrXJhD7FGCDTS0xZgNVtPdKJBSg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Can :traineddata" for Tesseract 3 be used for Tesseract 4

2018-06-13 Thread ShreeDevi Kumar

If you have box tiff pairs in tesseract4 format you can generate the lstmf
files by running

tesseract   lang.file.exp0.tif lang.file.exp0   lstm.train

lstm.train is  a config file.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Wed, Jun 13, 2018 at 6:46 PM chandra churh chatterjee <
chandrachurh.chatterje...@gmail.com> wrote:

> I have trained tesseract 3 with 64 fonts using respective box and .tr
> files, But now i want to use the same trained data for training tesseract 4
> after creating the starter trained data using the "Using tesstrain
>
> The setup for running tesstrain.sh is the same as for base Tesseract. Use
> --linedata_only option for LSTM training. Note that it is beneficial to
> have more training text and make more pages though, as neural nets don't
> generalize as well and need to train on something similar to what they will
> be running on. If the target domain is severely limited, then all the dire
> warnings about needing a lot of training data may not apply, but the
> network specification may need to be changed.
>
> Training data is created using tesstrain.sh
> 
>  as
> follows: Note that your fonts location may vary.
>
> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only 
> \
>   --noextract_font_properties --langdata_dir ../langdata \
>   --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
>
> The above command makes LSTM training data equivalent to the data used to
> train base Tesseract for English. For making a general-purpose LSTM-based
> OCR engine, it is woefully inadequate, but makes a good tutorial demo.
>
> Now try this to make eval data for the 'Impact' font:
>
> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only 
> \
>   --noextract_font_properties --langdata_dir ../langdata \
>   --tessdata_dir ./tessdata \
>
>   --fontlist "Impact Condensed" --output_dir ~/tesstutorial/engeval"
>
>
>
> Now i want to proceed further using my previous trained data to do the
> training but the problem is that the previous trained data had .tr files
> and box files but tesseract 4 requires .lstmf files .
> Requesting for any solution.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/f3d6c64e-7763-478e-b047-a64edd032d99%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWD0-BJ6sq4mypJhnc5FKudVcmSeBg%2BB5w5EARV4NPL4g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] How to assess the quality of Tesseract OCR output programmatically?

2018-06-13 Thread ShreeDevi Kumar

You can compare OCRed text with groundtruth text. If creating pdf, you will
have to extract text from it to compare.

There are two options:

https://github.com/impactcentre/ocrevalUAtion

or

https://github.com/eddieantonio/isri-ocr-evaluation-tools
https://github.com/ryanfb/ancientgreekocr-ocr-evaluation-tools

On Wed, Jun 13, 2018 at 12:41 PM nitin  wrote:

> Hi Dear members,
>
> Is there a way to 'assess the quality of Tesseract OCR output'?
>
> I need to provide such statistics along with the scanned image-to-pdf
> output file results,
> so the users can decide and sort whether the out-put quality is acceptable
> or not (like above 50%80% recognition done successfully).
> Also I need to determine this programmatically.
>
> Thanks for your time.
> Regards
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/478eb151-63e2-4ac5-b9ba-4d0ec1498076%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVxG7JquGC8HDuNw8LKfGA8L%3DiVP_sSQG-x2rpiH3Z1YA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Tesseract 4 for old languages

2018-06-12 Thread ShreeDevi Kumar

Please also see http://doc-creator.labri.fr/

which makes it easy to create synthetic data similar to manuscript pages.


On Tue, Jun 12, 2018 at 9:03 PM ShreeDevi Kumar 
wrote:

> Please see the project https://github.com/OCR-D/ocrd-train
>
> It has support for training tesseract if you provide line images and
> matching ground truth text.
>
>
> On Tue, Jun 12, 2018 at 8:19 PM  wrote:
>
>> Same question here. I see that the documentation on training Tesseract 4
>> makes some reference to manuscripts:
>>
>> As with base Tesseract, there is a choice between rendering synthetic
>> training data from fonts, or labeling some pre-existing images (like
>> ancient manuscripts for example).
>>
>> So, if I understand correctly, there is no support yet for training with
>> labelled pre-existing images ? The concept of font does not makes sense
>> with manuscripts, and what we can use in this case is just pairs of images
>> and text (transcription).
>>
>> Best,
>> Jean-Baptiste Camps
>>
>> Le lundi 12 mars 2018 10:59:41 UTC+1, shree a écrit :
>>>
>>> >I have an image and a text file with the line content for each line of
>>> manuscript text. The doc says what to do, but not how.
>>>
>>> >I first thought I'd need img/box files pairs, but it seems it was for
>>> Tesseract 3 and is now irrelevant...
>>>
>>> Tesseract4.0.0beta.1 does not officially support LSTM training from
>>> box/tif pairs.
>>>
>>> It uses box/tif pairs generated using the synthetic training data
>>> generation pipeline using a training_text and set of fonts, for making the
>>> lstmf files that are used by lstmtraining.
>>>
>>> langdata refers to the langdata repository under tesseract-ocr github
>>> repo. The files in it have not been updated for 4.0.0
>>>
>>>
>>>
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Mon, Mar 12, 2018 at 2:00 PM, ShreeDevi Kumar 
>>> wrote:
>>>
>>>> Please try tesseract 4.0.0beta.1  with languages such as
>>>>
>>>> *enm* (English, Middle (1100-1500))
>>>>
>>>> and
>>>>
>>>> Fraktur  script
>>>>
>>>> Also, look at the following project from a few years back
>>>>
>>>> http://emop.tamu.edu/outcomes/Franken-Plus
>>>>
>>>> ShreeDevi
>>>> 
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>> On Mon, Mar 12, 2018 at 4:32 AM, Guillaume Desforges 
>>>> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I want to try using Tesseract 4 for old manuscript languages ("The
>>>>> Song of Roland" and such).
>>>>>
>>>>> I have looked at
>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>>>>> but the steps are very unclear.
>>>>>
>>>>> I have an image and a text file with the line content for each line of
>>>>> manuscript text. The doc says what to do, but not how.
>>>>>
>>>>> I first thought I'd need img/box files pairs, but it seems it was for
>>>>> Tesseract 3 and is now irrelevant...
>>>>>
>>>>> So I guess my starting point is here :
>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#tutorial-guide-to-lstmtraining
>>>>>
>>>>> There is no tool to create the lstm-recoder directly. Instead there
>>>>>> is a new tool, combine_lang_model which takes as input an
>>>>>> input_unicharset and script_dir(script_dir points to the langdata 
>>>>>> directory)
>>>>>> and optional word list files. It creates the lstm-recoder from the
>>>>>> input_unicharset and creates all the dawgs, if wordlists are
>>>>>> provided, putting everything together into a traineddata file.
>>>>>
>>>>>
>>>>> I don't really get this part. How do I make  input_unicharset ? What
>>>>> is langdata?
>>>>>
>>>>> Thanks
>>>>>
>>>>> Guillaume Desforges
>>>>>
>>>>> --
>>>>> You received this message because you

Re: [tesseract-ocr] Tesseract 4 for old languages

2018-06-12 Thread ShreeDevi Kumar

Please see the project https://github.com/OCR-D/ocrd-train

It has support for training tesseract if you provide line images and
matching ground truth text.


On Tue, Jun 12, 2018 at 8:19 PM  wrote:

> Same question here. I see that the documentation on training Tesseract 4
> makes some reference to manuscripts:
>
> As with base Tesseract, there is a choice between rendering synthetic
> training data from fonts, or labeling some pre-existing images (like
> ancient manuscripts for example).
>
> So, if I understand correctly, there is no support yet for training with
> labelled pre-existing images ? The concept of font does not makes sense
> with manuscripts, and what we can use in this case is just pairs of images
> and text (transcription).
>
> Best,
> Jean-Baptiste Camps
>
> Le lundi 12 mars 2018 10:59:41 UTC+1, shree a écrit :
>>
>> >I have an image and a text file with the line content for each line of
>> manuscript text. The doc says what to do, but not how.
>>
>> >I first thought I'd need img/box files pairs, but it seems it was for
>> Tesseract 3 and is now irrelevant...
>>
>> Tesseract4.0.0beta.1 does not officially support LSTM training from
>> box/tif pairs.
>>
>> It uses box/tif pairs generated using the synthetic training data
>> generation pipeline using a training_text and set of fonts, for making the
>> lstmf files that are used by lstmtraining.
>>
>> langdata refers to the langdata repository under tesseract-ocr github
>> repo. The files in it have not been updated for 4.0.0
>>
>>
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Mon, Mar 12, 2018 at 2:00 PM, ShreeDevi Kumar 
>> wrote:
>>
>>> Please try tesseract 4.0.0beta.1  with languages such as
>>>
>>> *enm* (English, Middle (1100-1500))
>>>
>>> and
>>>
>>> Fraktur  script
>>>
>>> Also, look at the following project from a few years back
>>>
>>> http://emop.tamu.edu/outcomes/Franken-Plus
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Mon, Mar 12, 2018 at 4:32 AM, Guillaume Desforges 
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> I want to try using Tesseract 4 for old manuscript languages ("The Song
>>>> of Roland" and such).
>>>>
>>>> I have looked at
>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>>>> but the steps are very unclear.
>>>>
>>>> I have an image and a text file with the line content for each line of
>>>> manuscript text. The doc says what to do, but not how.
>>>>
>>>> I first thought I'd need img/box files pairs, but it seems it was for
>>>> Tesseract 3 and is now irrelevant...
>>>>
>>>> So I guess my starting point is here :
>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#tutorial-guide-to-lstmtraining
>>>>
>>>> There is no tool to create the lstm-recoder directly. Instead there is
>>>>> a new tool, combine_lang_model which takes as input an
>>>>> input_unicharset and script_dir(script_dir points to the langdata 
>>>>> directory)
>>>>> and optional word list files. It creates the lstm-recoder from the
>>>>> input_unicharset and creates all the dawgs, if wordlists are
>>>>> provided, putting everything together into a traineddata file.
>>>>
>>>>
>>>> I don't really get this part. How do I make  input_unicharset ? What
>>>> is langdata?
>>>>
>>>> Thanks
>>>>
>>>> Guillaume Desforges
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/fe1d68a2-76ce-4005-98ea-672710365517%40googlegroups.com
>>>> <https://groups.g

Re: [tesseract-ocr] Re: use multi threads in tesseract

2018-06-12 Thread ShreeDevi Kumar

Thank you for the info.

The following link also has helpful info.

https://www.ibm.com/support/knowledgecenter/SSGH2K_13.1.2/com.ibm.xlc131.aix.doc/compiler_ref/omp_thread_limit.html


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Fri, Jun 8, 2018 at 2:39 PM Jakob Salomonsson <
jakob.salomons...@gmail.com> wrote:

> After speaking to one of the tesseract contributors:
>
> import os
>
> os.environ['OMP_THREAD_LIMIT'] = '2'
>
>
> Should do the job, as OMP_NUM_THREADS is an environment variable. However,
> the speed difference is very small. It might be better to process several
> images in parallell rather than to process one as fast as possible.
>
>
>
>
>
> Den tis 29 maj 2018 kl 07:56 skrev nick :
>
>> hi
>> I don't know , which file should change for *OMP_NUM_THREADS ? or wihch
>> command should test ?*
>>
>> On Monday, May 28, 2018 at 3:10:25 PM UTC+4:30, Jakob Salomonsson wrote:
>>>
>>> Calling the help function in python through
>>> help(pytesseract.pytesseract) yields this result, among others:
>>>
>>> *DATA*
>>> *OMP_NUM_THREADS = 3*
>>> *OMP_THREAD_LIMIT = 3*
>>> *RGB_MODE = 'RGB'*
>>> *__warningregistry__ = {'version': 332, ('unclosed file
>>> <_io.BufferedWr...*
>>> *numpy_installed = True*
>>> *tesseract_cmd = '/anaconda3/envs/Work/bin/tesseract'*
>>>
>>>
>>> Im specifying *tesseract_cmd *(through: * 
>>> pytesseract.pytesseract.tesseract_cmd
>>> = "/anaconda3/envs/Work/bin/tesseract"*) and it works as intended.
>>> But when I try to do the same with *OMP_NUM_THREADS or *
>>> *OMP_THREAD_LIMIT* (through: *pytesseract.pytesseract.OMP_NUM_THREADS =
>>> 3 *or *pytesseract.pytesseract.OMP_THREAD_LIMIT = 3*) no multi
>>> threading is happening.
>>>
>>>
>>>
>>> Den mån 28 maj 2018 kl 12:11 skrev nick :
>>>
 how  and where we could change this variable ?

 --
 You received this message because you are subscribed to a topic in the
 Google Groups "tesseract-ocr" group.
 To unsubscribe from this topic, visit
 https://groups.google.com/d/topic/tesseract-ocr/HA_q6F1_34E/unsubscribe
 .
 To unsubscribe from this group and all its topics, send an email to
 tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/tesseract-ocr/241b0072-f83e-44b4-a4c5-4136f7bbcfa9%40googlegroups.com
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/tesseract-ocr/HA_q6F1_34E/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/ebc34df5-25de-48ef-90d7-f07f1df30009%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CALhJBJESr%2BK8S%3DjqjxgbHAn%2BeTz_9KLqYkR5-ER0zZ-5TMntUw%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUziyWanG372apVyuYF%3Dpq6Sg71awr%2BitGOsZjPohtzFA%40mail.gmail.com.
For more options, visit

Re: [tesseract-ocr] Image DPI restriction

2018-06-11 Thread ShreeDevi Kumar

For better recognition 300 dpi is recommended.

You can use a program like imagemagick to change dpi if needed.


On Mon, Jun 11, 2018 at 8:30 PM Vidur Malhotra 
wrote:

> Hi,
> I was going through tesseract tutorials wherein it is mentioned that for
> Tesseract to do OCR, image should have alteast 300dpi. How is it possible?
> I tried capturing images from different phones (even iPhone 7plus), all of
> them are giving 72dpi.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/27329164-29c8-4686-8ca8-d2f70dc0dfa9%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWmwZnZFTFgaVCOZmL9P_r7oU2VYrMtLRot%3DhzKa9Zp9g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] [SOLVED] Re: tess4j: NullPointerException while reading text in rectangle of image.

2018-06-09 Thread ShreeDevi Kumar

For tess4j see

https://github.com/nguyenq/tess4j/blob/master/src/test/java/net/sourceforge/tess4j/TessAPI1Test.java



On Sun 10 Jun, 2018, 12:51 AM Dattatraya Tembare, 
wrote:

> I have used another method, and it worked perfectly.
>
> public static void main(String[] args) {
>  String fileStr = "C:/EA/mp-out/im/1/3/1-0.png";
>  File file = new File(fileStr);
>  //380x45+220+170
>  int xsize = 0;
>  int ysize = 0;
>  BufferedImage bufImage=null;
>  ByteBuffer buf = null;
>  try {
>  bufImage = ImageIO.read(file);
>  IIOImage image=new IIOImage(bufImage,null,null);
>  buf = ImageIOHelper.getImageByteBuffer(image);
>  } catch (IOException e2) {
>  e2.printStackTrace();
>  }
>  Rectangle rect = new Rectangle(220, 170, 380, 45); // define an equal or
> smaller region of interest on the image
>  int bpp = 8; //Gray=8, RGB=24
>
>  Tesseract in = new ReadImageText().getTesseractInstance("C:/Program
> Files (x86)/Tesseract-OCR/tessdata/", "hin");
>  try {
>  String resultText = in.doOCR(bufImage, rect);
>  //in.doOCR(xsize, ysize, buf, rect, bpp);
>  log.info("resultText: {}", resultText);
>  } catch (TesseractException e) {
>  e.printStackTrace();
>  }
>  }
>
>
> On Saturday, June 9, 2018 at 3:07:02 PM UTC-4, Dattatraya Tembare wrote:
>>
>> I'm trying to read the text from an image at some particular location in
>> an image. I have image dimensions and desired rectangle dometions.
>> Here is the code implementation:
>>
>> public static void main(String[] args) {
>>  String fileStr = "C:/EA/mp-out/im/1/3/1-0.png";
>>  File file = new File(fileStr);
>>  //380x45+220+170
>>  int xsize = 0;
>>  int ysize = 0;
>>  BufferedImage bufImage;
>>  ByteBuffer buf = null;
>>  try {
>>  bufImage = ImageIO.read(file);
>>  IIOImage image=new IIOImage(bufImage,null,null);
>>  buf = ImageIOHelper.getImageByteBuffer(image);
>>  } catch (IOException e2) {
>>  e2.printStackTrace();
>>  }
>>  Rectangle rect = new Rectangle(0, 0, 600, 265); // define an equal or
>> smaller region of interest on the image
>>  int bpp = 8; //Gray=8, RGB=24
>>
>>  Tesseract in = new ReadImageText().getTesseractInstance("C:/Program
>> Files (x86)/Tesseract-OCR/tessdata/", "hin");
>>  try {
>>  String resultText = in.doOCR(xsize, ysize, buf, rect, bpp);
>>  log.info("resultText: {}", resultText);
>>  } catch (TesseractException e) {
>>  e.printStackTrace();
>>  }
>>  }
>>
>> When I have executed the code, go below error:
>>
>> java.lang.NullPointerException: null
>>  at net.sourceforge.tess4j.Tesseract.getOCRText(Tesseract.java:434)
>>  at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:351)
>>  at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:323)
>>  at com.ea.ocr.tesseract.ReadImageText.main(ReadImageText.java:74)
>>
>> Please look into it and let me know if anyone has any idea.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/243234f1-405a-45bb-a249-dd8eebd0e9f9%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUyDSJ%2BvER64TySsg82jaP-UAH1T9BX%3Db_yzJDWnRjjZg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] error

2018-06-09 Thread ShreeDevi Kumar

You are probably using a wrong traineddata file i.e. 3.0x version file with
latest 4.0x code from master branch.


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Sat, Jun 9, 2018 at 3:33 PM Vishal Jha  wrote:

> 1, 'read_params_file: parameter not found: enable_new_segsearch')
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c76d5113-2583-4e72-8c7b-59eee8b39f02%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUFqW8U_2T7Cfc9xCAY7XO6uF1JeFCdy3yicNWG%3D2%2B%2B-w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Unrecognized argument --linedata_only

2018-06-09 Thread ShreeDevi Kumar

Try without   --eval_listfile /home/kddlab/Desktop/tesseract-master/1MyData/
testfas1/fas.training_files.txt \

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Sat, Jun 9, 2018 at 1:58 PM Khosrobeigy.zohreh 
wrote:

> Thank. by your command fixed.
>  but next i used this:
>
> lstmtraining   \
>   --traineddata
> /home/kddlab/Desktop/tesseract-master/1MyData/testfas/fas/fas.traineddata
>  --net_spec '[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1]' \
>   --model_output
> /home/kddlab/Desktop/tesseract-master/1MyData/testfasout/base
> --learning_rate 20e-4 \
>   --train_listfile
> /home/kddlab/Desktop/tesseract-master/1MyData/testfas/fas.training_files.txt
> \
>   --eval_listfile
> /home/kddlab/Desktop/tesseract-master/1MyData/testfas1/fas.training_files.txt
> \
>   --max_iterations 5000
> &>/home/kddlab/Desktop/tesseract-master/1MyData/testfasout/basetrain.log
>  and i have this *error now*
>
> *Segmentation fault (core dumped)*
>
>
> Could you please help me again?
>
> On Sat, Jun 9, 2018 at 11:33 AM, ShreeDevi Kumar 
> wrote:
>
>> --linedata_only should work.
>>
>> > tesseract 4.0.0-beta.1
>>
>> Do you know which commit? Please try with latest code.
>>
>> >   i am using   src/training/tesstrain.sh
>>
>> The command you used was:
>>
>> >  sudo tesstrain.sh
>>
>> Why do you need sudo?
>>
>> Please run the script with
>>
>> bash -x   src/training/tesstrain.sh etc ... and report with the console
>> log.
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>>
>> On Sat, Jun 9, 2018 at 11:57 AM Zohreh Khosrobeygi <
>> beigy.zoh...@gmail.com> wrote:
>>
>>> Yes, i am using   src/training/tesstrain.sh
>>>
>>>
>>> On Friday, June 8, 2018 at 6:44:27 PM UTC+4:30, shree wrote:
>>>>
>>>> Are you using the correct version of tesstrain.sh?
>>>>
>>>> It should be in src/training/tesstrain.sh
>>>>
>>>>
>>>> ShreeDevi
>>>> 
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>>
>>>> On Fri, Jun 8, 2018 at 6:49 PM Zohreh Khosrobeygi 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> I have been training tesseract but i have this errore"
>>>>>
>>>>> Unrecognized argument --linedata_only
>>>>>
>>>>> And it's my version of tesseract"
>>>>> tesseract 4.0.0-beta.1
>>>>>  leptonica-1.74.4
>>>>>   libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 :
>>>>> zlib 1.2.8
>>>>>
>>>>>  Found AVX2
>>>>>  Found AVX
>>>>>  Found SSE
>>>>>
>>>>> Besides it's my command:
>>>>> sudo tesstrain.sh --fonts_dir /usr/share/fonts --lang fas
>>>>> --training_text
>>>>> /home/kddlab/Desktop/tesseract-master/1MyData/fas/fas.training_text
>>>>>  --linedata_only \
>>>>>   --noextract_font_properties --langdata_dir
>>>>> /home/kddlab/Desktop/tesseract-master/langdata \
>>>>>   --tessdata_dir /home/kddlab/Desktop/tesseract-master/tessdata \
>>>>>   --fontlist "B Mitra" --output_dir
>>>>> /home/kddlab/Desktop/tesseract-master/1MyData/testfas
>>>>>
>>>>> And i have config file:
>>>>> # Use LSTM
>>>>> tessedit_ocr_engine_mode 1
>>>>> tessedit_pageseg_mode 6
>>>>>
>>>>> How can i solve this?
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/a692d903-34be-4a51-99c5-11ed34bb6cef%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a

Re: [tesseract-ocr] Unrecognized argument --linedata_only

2018-06-09 Thread ShreeDevi Kumar

--linedata_only should work.

> tesseract 4.0.0-beta.1

Do you know which commit? Please try with latest code.

>   i am using   src/training/tesstrain.sh

The command you used was:

>  sudo tesstrain.sh

Why do you need sudo?

Please run the script with

bash -x   src/training/tesstrain.sh etc ... and report with the console log.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Sat, Jun 9, 2018 at 11:57 AM Zohreh Khosrobeygi 
wrote:

> Yes, i am using   src/training/tesstrain.sh
>
>
> On Friday, June 8, 2018 at 6:44:27 PM UTC+4:30, shree wrote:
>>
>> Are you using the correct version of tesstrain.sh?
>>
>> It should be in src/training/tesstrain.sh
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>>
>> On Fri, Jun 8, 2018 at 6:49 PM Zohreh Khosrobeygi 
>> wrote:
>>
>>> Hi,
>>> I have been training tesseract but i have this errore"
>>>
>>> Unrecognized argument --linedata_only
>>>
>>> And it's my version of tesseract"
>>> tesseract 4.0.0-beta.1
>>>  leptonica-1.74.4
>>>   libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 :
>>> zlib 1.2.8
>>>
>>>  Found AVX2
>>>  Found AVX
>>>  Found SSE
>>>
>>> Besides it's my command:
>>> sudo tesstrain.sh --fonts_dir /usr/share/fonts --lang fas
>>> --training_text
>>> /home/kddlab/Desktop/tesseract-master/1MyData/fas/fas.training_text
>>>  --linedata_only \
>>>   --noextract_font_properties --langdata_dir
>>> /home/kddlab/Desktop/tesseract-master/langdata \
>>>   --tessdata_dir /home/kddlab/Desktop/tesseract-master/tessdata \
>>>   --fontlist "B Mitra" --output_dir
>>> /home/kddlab/Desktop/tesseract-master/1MyData/testfas
>>>
>>> And i have config file:
>>> # Use LSTM
>>> tessedit_ocr_engine_mode 1
>>> tessedit_pageseg_mode 6
>>>
>>> How can i solve this?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/a692d903-34be-4a51-99c5-11ed34bb6cef%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/caf0b092-1a2c-4e73-9171-16678495af51%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWUNUQGwuRfmQ5EsdewAcvBjxOEzKcTrBqYmrSynHuoWg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Unrecognized argument --linedata_only

2018-06-08 Thread ShreeDevi Kumar

Are you using the correct version of tesstrain.sh?

It should be in src/training/tesstrain.sh


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Fri, Jun 8, 2018 at 6:49 PM Zohreh Khosrobeygi 
wrote:

> Hi,
> I have been training tesseract but i have this errore"
>
> Unrecognized argument --linedata_only
>
> And it's my version of tesseract"
> tesseract 4.0.0-beta.1
>  leptonica-1.74.4
>   libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib
> 1.2.8
>
>  Found AVX2
>  Found AVX
>  Found SSE
>
> Besides it's my command:
> sudo tesstrain.sh --fonts_dir /usr/share/fonts --lang fas
> --training_text
> /home/kddlab/Desktop/tesseract-master/1MyData/fas/fas.training_text
>  --linedata_only \
>   --noextract_font_properties --langdata_dir
> /home/kddlab/Desktop/tesseract-master/langdata \
>   --tessdata_dir /home/kddlab/Desktop/tesseract-master/tessdata \
>   --fontlist "B Mitra" --output_dir
> /home/kddlab/Desktop/tesseract-master/1MyData/testfas
>
> And i have config file:
> # Use LSTM
> tessedit_ocr_engine_mode 1
> tessedit_pageseg_mode 6
>
> How can i solve this?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/a692d903-34be-4a51-99c5-11ed34bb6cef%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUnbCndb0zG2Ma40kkK4vua4-%3Dpa4SBfPbCegXSdf75ig%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Suggestion for the API

2018-06-07 Thread ShreeDevi Kumar

You can provide this info as a Pull Request in GitHub repo for easier
review and search.


On Wed, Jun 6, 2018 at 2:24 PM Paul TOTH  wrote:

> Hello,
>
> I'm not a C++ developer and I'm new to the project so I don't want to
> disturb the repository with my code...but, I've made some changes that
> could be interesting.
>
> my purpose was to use libTesseract (3) from a Delphi application.
>
> first change, I've added a function to deal with in-memory image.
>
> TESS_API BOOL  TESS_CALL TessBaseAPIProcessPagesData(TessBaseAPI* handle,
> const unsigned char* imagedata, int imagesize, ETEXT_DESC* monitor,
> TessResultRenderer* renderer);
>
> the code is almost the same as for stdin but with provided imagedata and
> with a monitor to handle progression.
>
> then I'de like to add my own output handler so I've added this function
>
> typedef int(*WRITE_FUNC)(void* sender, const char* data, int size);
>
> TESS_API void TESS_CALL TessResultRendererWriteCallback(TessResultRenderer
> * renderer, WRITE_FUNC writefunc, void* sender);
>
>
> with just a fix on TessResultRenderer
>
> void TessResultRenderer::AppendData(const char* s, int len) {
>   int n = writefunc_ ? writefunc_(writesender_, s, len) : fwrite(s, 1, len
> , fout_);
>   if (n != len) happy_ = false;
> }
>
> Now I am able to convert any image from a memory stream to a any kind of
> stream...this is usefull for database image processing.
>
> it could be better with a source image streaming but it require more
> changes.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/3d164e72-13e8-442f-836f-702caa9132ce%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXusB4bz%2BOfdr4Skj1v9t7vU-2OXP6JEtp_AKzSoaP7yQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Preprocess Image

2018-06-04 Thread ShreeDevi Kumar

Take a look at http://www.fmwconcepts.com/imagemagick/textcleaner/
and other scripts by Fred

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jun 4, 2018 at 10:52 PM, Hongguo An  wrote:

> Can anybody help? thanks in advance
>
> On Thursday, May 31, 2018 at 12:57:20 PM UTC-7, Hongguo An wrote:
>>
>>
>> 
>> Hi:
>> When trying to OCR the above image, the date 09/02/2017 is always wrong,
>> (0G/02/2017).
>>
>>
>> This is tesseract 4 running on linux, the cmd line is:
>>
>> *tesseract stdin stdout -l eng --psm 11 --oem 1 -c textonly_pdf=1 -c
>> tessedit_create_pdf=1 | pdftotext -layout - - *
>>
>>
>> Is there any way to pre-process the image to make it work? (preferably
>> using convert)
>>
>>
>> Thanks
>>
>> Hongguo An
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/fd0e766e-fba2-43a7-91ea-51de94f621b2%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWc2Brpt6ExAkZQ5wMWOsbMDrHK9Y7PMspJBH6_%3DCeeAg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] How to train by tesseract 4.00

2018-06-03 Thread ShreeDevi Kumar

If you want to train using fonts, use tesstrain.sh. See the wiki pages
regarding training.

If you want to use scanned images, then see
https://github.com/OCR-D/ocrd-train for using line images and their ground
truth transcriptions to create box files, lstmf files and training.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Jun 3, 2018 at 3:59 PM,  wrote:

> I have read that on the version of 4.00, the box file can be used  only
> need to cover a textline instead of individual characters.
>
> So I make a box file like this
>
> 若存在，试求出实数λ的值； 0 0 256 48 0
>
> Then I want to ask how to train it.
>
> Or is it the same version 3?   【tesseract chi_my.font.exp0.tif
> chi_my.font.exp0 nobatch box.train】
>
> or there is other better method.
>
> Thanks!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/f65b5c86-e921-455d-9076-c2ff230dac5b%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXr6VgxG4CmS75crmTZ%2BYHW%3DKQTwvcAV0ixRsRd3h7zkg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] error in lstm training

2018-06-02 Thread ShreeDevi Kumar

> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244

You can only continue_from models in tessdata_best repo which are float
models. The integer models in tessdata and tessdata_fast can not be used
for that purpose.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Jun 2, 2018 at 4:15 PM, nick  wrote:

> hi
>
> i tried to finetune eng.traineddata. in lstm training raised this error :
>
>
> lstmtraining
>
> --continue_from ./tesseract-4.0.0-beta.1.20180414/tessdata/eng.lstm
>
> --traineddata ./tesseract-4.0.0-beta.1.20180414/tessdata/eng.traineddata
>
> --max_iterations 400
>
> --debug_interval 0
>
> --train_listfile ./finetune_train_eng/eng.training_files.txt
>
> --model_output ./finetune_trained_eng-from-eng/finetune
>
>
> ERROR:
>
> Loaded file ./tesseract-4.0.0-beta.1.20180414/tessdata/eng.lstm,
> unpacking...
>
> Warning: LSTMTrainer deserialized an LSTMRecognizer!
>
> Continuing from ./tesseract-4.0.0-beta.1.20180414/tessdata/eng.lstm
>
> Loaded 72/72 pages (1-72) of document ./finetune_train_eng/eng.
> Arial.exp0.lstmf
>
> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
>
> Segmentation fault (core dumped)
>
>
>
>
> how could i solve it ?
> thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/c71ffd91-ba39-41ab-a4a2-4db77b6be6d9%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUjjxZ6pbux-znDy5M_Oyt%3DkGWEZz1U-g8oy59Zp01MqQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] lstmeval gives a perfect result but tesseract fails

2018-06-01 Thread ShreeDevi Kumar

>From what I understand from the documentation provided by Ray Smith
regarding LSTM training, the models have been trained on hundreds of
thousands of lines and  hundreds of fonts. The network spec used for
training from scratch will therefore be optimized for such large models.

You seem to have a different requirement, hence I suggested building the
legacy tesseract model.

You can experiment and see if it is better.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Jun 1, 2018 at 12:23 PM, Julien Jemine 
wrote:

> Hi Shree,
>
> Thanks for your answer.
> If you don't mind, could you explain why it'd be better ?
>
> Le jeudi 31 mai 2018 17:25:47 UTC+2, shree a écrit :
>>
>> >I've trained a LSTM model for a custom language from scratch as explained
>>  here
>> .
>>
>> >The language only has about 100 words and 17 characters, so it's pretty
>> simple.
>>
>> For such a small model, try to build the legacy version rather than LSTM.
>>
>> $tesstrain_dir/tesstrain.sh \
>>--lang $Lang \
>>--exposures "0" \
>>--fonts_dir $fonts_dir \
>>--fontlist $fonts_for_training \
>>--langdata_dir $langdata_dir \
>>--tessdata_dir  $tessdata_dir \
>>--training_text $langdata_dir/$Lang/$Lang.training_text \
>>--output_dir $train_output_dir
>>
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Thu, May 31, 2018 at 3:43 PM, Julien Jemine 
>> wrote:
>>
>>> Hi,
>>>
>>> I've trained a LSTM model for a custom language from scratch as
>>> explained here
>>> 
>>> .
>>>
>>> The language only has about 100 words and 17 characters, so it's pretty
>>> simple.
>>>
>>> When I run lstmeval on my model, I get a perfect match:
>>> [icm@u16-offcao-07] train1$ lstmeval --model
>>> /home/icm/share/tessdata/iqi.traineddata --eval_listfile
>>> iqitrain2/iqi.training_files.txt --verbosity 2
>>> Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/iqi
>>> .Arial.exp0.lstmf
>>> Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/iqi
>>> .Calibri.exp0.lstmf
>>> Warning: LSTMTrainer deserialized an LSTMRecognizer!
>>> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
>>> OCR  :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
>>> Truth:6CUEN 6 CU EN
>>> OCR  :6CUEN 6 CU EN
>>> Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/iqi
>>> .Lucida_Sans_Typewriter_Semi-Condensed.exp0.lstmf
>>> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
>>> OCR  :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
>>> Truth:6CUEN 6 CU EN
>>> OCR  :6CUEN 6 CU EN
>>> Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/iqi
>>> .Verdana.exp0.lstmf
>>> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
>>> OCR  :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
>>> Truth:6CUEN 6 CU EN
>>> OCR  :6CUEN 6 CU EN
>>> Truth:6CUEN 6 CU EN
>>> OCR  :6CUEN 6 CU EN
>>> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
>>> OCR  :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
>>> At iteration 0, stage 0, Eval Char error rate=0, Word error rate=0
>>>
>>> However, when I put my iqi.traineddata file in my tessdata folder and
>>> try to run tesseract on the same tif file, I get errors:
>>> [icm@u16-offcao-07] train1$ tesseract iqitrain2/iqi.training_img.txt
>>> stdout -l iqi
>>> Page 0 : /home/icm/train1/iqitrain2/iqi.Arial.exp0.tif
>>> 6CFN
>>> 6CUEN 1 CU EN
>>> Page 1 : /home/icm/train1/iqitrain2/iqi.Calibri.exp0.tif
>>>
>>> 6CM 10FEEN 0 6 FEE 13CUEN 11 6 FE EEN 1116
>>> 6UEN 16 FE
>>> Page 2 : /home/icm/train1/iqitrain2/iqi.Lucida_Sans_Typewriter_Semi-
>>> Condensed.exp0.tif
>>>
>>> 6TM 13CUEN 13 1 EN 11CUE 11 CU EN 12B 11 16
>>> 6 6 CU EN
>>> Page 3 : /home/icm/train1/iqitrain2/iqi.Verdana.exp0.tif
>>>
>>> ASTM 103UEEN 13 1CU EN 13CUEN 13 6 FE EEN 11 16
>>> 6CUEN 6 CU EN
>>>
>>>
>>> Now the really frustrating part: I have the opposite phenomenon with the
>>> "eng" language! (with eng.traineddata taken from tessdata_best)
>>> lstmeval gives me a few errors (Eval Char error rate=2.4665552, Word
>>> error rate=16.67)
>>> tesseract gives me the right answer! (But the images are generated with
>>> tesstrain.sh and very common fonts, it's probably to be expected).
>>>
>>> Am I doing something wrong?
>>> What's going on here?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit

Re: [tesseract-ocr] Not able install tesseract ocr on ubuntu 17.04

2018-06-01 Thread ShreeDevi Kumar

Please see the email from Alex and follow instructions in that.

On Fri 1 Jun, 2018, 10:08 AM RT-Rakesh,  wrote:

>
> Hi ShreeDevi,
>
> Thanks for your response.
>
> I am still getting this error when trying with the command that you shared.
> Please assist me how to go about here.
>
> Thank you very much.
>
> user@computer:~$ sudo apt install tesseract-ocr
> Reading package lists... Done
> Building dependency tree
> Reading state information... Done
> The following packages were automatically installed and are no longer
> required:
>   libgnutls-openssl27 postfix-sqlite
> Use 'sudo apt autoremove' to remove them.
> The following additional packages will be installed:
>   libgif7 liblept5 libtesseract4 tesseract-ocr-eng tesseract-ocr-osd
> The following NEW packages will be installed:
>   libgif7 liblept5 libtesseract4 tesseract-ocr tesseract-ocr-eng
> tesseract-ocr-osd
> 0 upgraded, 6 newly installed, 0 to remove and 180 not upgraded.
> Need to get 6,938 kB of archives.
> After this operation, 21.6 MB of additional disk space will be used.
> Do you want to continue? [Y/n] y
> Err:1 http://us.archive.ubuntu.com/ubuntu zesty/main amd64 libgif7 amd64
> 5.1.4-0.4
>   404  Not Found [IP: 91.189.91.23 80]
> Get:2 http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu zesty/main
> amd64 liblept5 amd64 1.74.4-1+nmu1ppa1~zesty1 [929 kB]
> Get:3 http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu zesty/main
> amd64 libtesseract4 amd64 4.00~git2192-10a8a67c-1ppa1~zesty1 [1,180 kB]
> Get:4 http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu zesty/main
> amd64 tesseract-ocr-eng all 4.00~git15-45ed289-1ppa1~zesty1 [1,590 kB]
>
> Get:5 http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu zesty/main
> amd64 tesseract-ocr-osd all 4.00~git15-45ed289-1ppa1~zesty1 [2,989 kB]
>
> Get:6 http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu zesty/main
> amd64 tesseract-ocr amd64 4.00~git2192-10a8a67c-1ppa1~zesty1 [219 kB]
>
> Fetched 6,907 kB in 25s (271 kB/s)
>
>
> E: Failed to fetch
> http://us.archive.ubuntu.com/ubuntu/pool/main/g/giflib/libgif7_5.1.4-0.4_amd64.deb
> 404  Not Found [IP: 91.189.91.23 80]
> E: Unable to fetch some archives, maybe run apt-get update or try with
> --fix-missing?
>
>
> On Thursday, 31 May 2018 15:24:48 UTC+5:30, shree wrote:
>>
>> Remove the existing version, then
>>
>>
>> sudo add-apt-repository ppa:alex-p/tesseract-ocr
>> sudo apt-get update
>>
>>
>> sudo apt install tesseract-ocr
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Thu, May 31, 2018 at 12:29 PM, RT-Rakesh  wrote:
>>
>>> user@computer:~$ sudo apt install tesseract-ocr
>>> Reading package lists... Done
>>> Building dependency tree
>>> Reading state information... Done
>>> The following packages were automatically installed and are no longer
>>> required:
>>>   libgnutls-openssl27 postfix-sqlite
>>> Use 'sudo apt autoremove' to remove them.
>>> The following additional packages will be installed:
>>>   libgif7 liblept5 libtesseract-data libtesseract3 tesseract-ocr-eng
>>>   tesseract-ocr-equ tesseract-ocr-osd
>>> The following NEW packages will be installed:
>>>   libgif7 liblept5 libtesseract-data libtesseract3 tesseract-ocr
>>>   tesseract-ocr-eng tesseract-ocr-equ tesseract-ocr-osd
>>> 0 upgraded, 8 newly installed, 0 to remove and 180 not upgraded.
>>> Need to get 945 kB/14.6 MB of archives.
>>> After this operation, 57.5 MB of additional disk space will be used.
>>> Do you want to continue? [Y/n] y
>>> Err:1 http://us.archive.ubuntu.com/ubuntu zesty/main amd64 libgif7
>>> amd64 5.1.4-0.4
>>>   404  Not Found [IP: 91.189.91.23 80]
>>> Err:2 http://us.archive.ubuntu.com/ubuntu zesty/universe amd64 liblept5
>>> amd64 1.74.1-1
>>>   404  Not Found [IP: 91.189.91.23 80]
>>> E: Failed to fetch
>>> http://us.archive.ubuntu.com/ubuntu/pool/main/g/giflib/libgif7_5.1.4-0.4_amd64.deb
>>> 404  Not Found [IP: 91.189.91.23 80]
>>> E: Failed to fetch
>>> http://us.archive.ubuntu.com/ubuntu/pool/universe/l/leptonlib/liblept5_1.74.1-1_amd64.deb
>>> 404  Not Found [IP: 91.189.91.23 80]
>>> E: Unable to fetch some archives, maybe run apt-get update or try with
>>> --fix-missing?
>>>
>>>
>>> *This is the error being thrown, can some one help me with how to solve
>>> this issue. *
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/06faa78f-7a57-4749-9cf2-e9bdce5721c1%40googlegroups.com
>>> 
>>> .
>>> For

Re: [tesseract-ocr] lstmeval gives a perfect result but tesseract fails

2018-05-31 Thread ShreeDevi Kumar

 >I've trained a LSTM model for a custom language from scratch as explained
here
.

>The language only has about 100 words and 17 characters, so it's pretty
simple.

For such a small model, try to build the legacy version rather than LSTM.

$tesstrain_dir/tesstrain.sh \
   --lang $Lang \
   --exposures "0" \
   --fonts_dir $fonts_dir \
   --fontlist $fonts_for_training \
   --langdata_dir $langdata_dir \
   --tessdata_dir  $tessdata_dir \
   --training_text $langdata_dir/$Lang/$Lang.training_text \
   --output_dir $train_output_dir



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, May 31, 2018 at 3:43 PM, Julien Jemine 
wrote:

> Hi,
>
> I've trained a LSTM model for a custom language from scratch as explained
> here
> .
>
> The language only has about 100 words and 17 characters, so it's pretty
> simple.
>
> When I run lstmeval on my model, I get a perfect match:
> [icm@u16-offcao-07] train1$ lstmeval --model 
> /home/icm/share/tessdata/iqi.traineddata
> --eval_listfile iqitrain2/iqi.training_files.txt --verbosity 2
> Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/
> iqi.Arial.exp0.lstmf
> Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/
> iqi.Calibri.exp0.lstmf
> Warning: LSTMTrainer deserialized an LSTMRecognizer!
> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
> OCR  :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
> Truth:6CUEN 6 CU EN
> OCR  :6CUEN 6 CU EN
> Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/
> iqi.Lucida_Sans_Typewriter_Semi-Condensed.exp0.lstmf
> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
> OCR  :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
> Truth:6CUEN 6 CU EN
> OCR  :6CUEN 6 CU EN
> Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/
> iqi.Verdana.exp0.lstmf
> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
> OCR  :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
> Truth:6CUEN 6 CU EN
> OCR  :6CUEN 6 CU EN
> Truth:6CUEN 6 CU EN
> OCR  :6CUEN 6 CU EN
> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
> OCR  :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
> At iteration 0, stage 0, Eval Char error rate=0, Word error rate=0
>
> However, when I put my iqi.traineddata file in my tessdata folder and try
> to run tesseract on the same tif file, I get errors:
> [icm@u16-offcao-07] train1$ tesseract iqitrain2/iqi.training_img.txt
> stdout -l iqi
> Page 0 : /home/icm/train1/iqitrain2/iqi.Arial.exp0.tif
> 6CFN
> 6CUEN 1 CU EN
> Page 1 : /home/icm/train1/iqitrain2/iqi.Calibri.exp0.tif
>
> 6CM 10FEEN 0 6 FEE 13CUEN 11 6 FE EEN 1116
> 6UEN 16 FE
> Page 2 : /home/icm/train1/iqitrain2/iqi.Lucida_Sans_Typewriter_
> Semi-Condensed.exp0.tif
>
> 6TM 13CUEN 13 1 EN 11CUE 11 CU EN 12B 11 16
> 6 6 CU EN
> Page 3 : /home/icm/train1/iqitrain2/iqi.Verdana.exp0.tif
>
> ASTM 103UEEN 13 1CU EN 13CUEN 13 6 FE EEN 11 16
> 6CUEN 6 CU EN
>
>
> Now the really frustrating part: I have the opposite phenomenon with the
> "eng" language! (with eng.traineddata taken from tessdata_best)
> lstmeval gives me a few errors (Eval Char error rate=2.4665552, Word error
> rate=16.67)
> tesseract gives me the right answer! (But the images are generated with
> tesstrain.sh and very common fonts, it's probably to be expected).
>
> Am I doing something wrong?
> What's going on here?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/67286720-c624-4239-a812-3c76d7603cf1%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWCsauX6u4MT4Uzutb0fXAiyg75iwy7x_vf9beAfrhZqw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

2018-05-30 Thread ShreeDevi Kumar

See https://github.com/OCR-D/ocrd-train/issues/7

You can use the utilities listed there for creating linelevel images from
page images. Make matching ground truth text files. and train.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 30, 2018 at 4:27 PM, Ramast Magdy  wrote:

> 1. collect utf-8 text in Coptic (DONE)
> 2. Find Coptic unicode fonts, if you can find one similar to the
> typewriter font used in books it will make training easier
> I tried but couldn't find such font. There are not that many Coptic fonts
> to being with.
> Can't I just extract few samples of each letter from the old books?
>
> 3. train a model with these and then finetune it with line images and
> matching ground truth
> I think I got this one.
> After extracting sample letters. arrange them randomly into separate lines
> (image for each line) and provide the text in a file with similar name.
>
> That's a good idea but since I am trying to train for reading old books,
> how can I account for things like slight page tilt during scanning for
> example?
> Also while at it, is there a tool I could use to split book pages into
> separate lines so that I can give it as part of training (along with it's
> text of course)
>
>
>
> On 05/30/2018 12:44 PM, ShreeDevi Kumar wrote:
>
> I am trying a test training for coptic for tess4, will let you know where
> to access traineddata.
>
> You can train using utf-8 textand unicode optic fonts.
>
> 1. collect utf-8 text in Coptic
> 2. Find Coptic unicode fonts, if you can find one similar to the
> typewriter font used in books it will make training easier
> 3. train a model with these and then finetune it with line images and
> matching ground truth
>
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Wed, May 30, 2018 at 4:09 PM, Ramast Magdy 
> wrote:
>
>> Thank you ShreeDevi for both moheb's link and the one below.
>> The current one uses Tesseract 3 and according to the author:
>> "Recognition quality of Coptic texts containing old fonts will be very
>> poor, depending on the trained data."
>>
>> I will get in contact with him to see if we can use the other link you
>> provided
>> https://github.com/OCR-D/ocrd-train
>> To train Tesseract 4.00
>>
>> Thank you very much
>>
>>
>> On 05/30/2018 06:31 AM, ShreeDevi Kumar wrote:
>>
>> See http://www.moheb.de/ocr.html
>>
>> It provides a traineddata file for Coptic for use with tesseract version
>> 3.
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, May 29, 2018 at 9:57 PM,  wrote:
>>
>>> Hi,
>>> I belong to a group who study an old Egyptian writing system called
>>> "Coptic".
>>> It's based mostly on Greek (with some variation).
>>>
>>> Big majority of books written in Coptic where during the last century
>>> and were mostly the same [typewriter] font.
>>> Here is a sample picture:
>>> https://imgur.com/a/ILRw6vm
>>> And sample book:
>>> https://archive.org/download/pistissophiaopu00petegoog
>>>
>>> We need to add Coptic to languages supported by Tesseract but not sure
>>> how.
>>> I tried following this document https://github.com/tesseract-o
>>> cr/tesseract/wiki/TrainingTesseract-4.00 but it's very difficult to
>>> understand.
>>>
>>> We need someone help us with the initial setup so that we can dedicate
>>> our man power to training the system.
>>> We are none profit group so we are hoping for free help but we would
>>> also consider paid help since the alternative is hundreds of hours of man
>>> labor to digitalize just few books.
>>>
>>> Thanks everyone for contributing to this awesome project
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/08869d08-8b3a-4390-be79-fa811c78c0ca%40goo
>>> glegroups.com
>>> &l

Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

2018-05-30 Thread ShreeDevi Kumar

> The current one uses Tesseract 3

Tesseract 3.ox has different formats for traineddata depending on the
version used 3.02 vs 3.04 etc.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 30, 2018 at 4:14 PM, ShreeDevi Kumar 
wrote:

> I am trying a test training for coptic for tess4, will let you know where
> to access traineddata.
>
> You can train using utf-8 textand unicode optic fonts.
>
> 1. collect utf-8 text in Coptic
> 2. Find Coptic unicode fonts, if you can find one similar to the
> typewriter font used in books it will make training easier
> 3. train a model with these and then finetune it with line images and
> matching ground truth
>
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Wed, May 30, 2018 at 4:09 PM, Ramast Magdy 
> wrote:
>
>> Thank you ShreeDevi for both moheb's link and the one below.
>> The current one uses Tesseract 3 and according to the author:
>> "Recognition quality of Coptic texts containing old fonts will be very
>> poor, depending on the trained data."
>>
>> I will get in contact with him to see if we can use the other link you
>> provided
>> https://github.com/OCR-D/ocrd-train
>> To train Tesseract 4.00
>>
>> Thank you very much
>>
>>
>> On 05/30/2018 06:31 AM, ShreeDevi Kumar wrote:
>>
>> See http://www.moheb.de/ocr.html
>>
>> It provides a traineddata file for Coptic for use with tesseract version
>> 3.
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, May 29, 2018 at 9:57 PM,  wrote:
>>
>>> Hi,
>>> I belong to a group who study an old Egyptian writing system called
>>> "Coptic".
>>> It's based mostly on Greek (with some variation).
>>>
>>> Big majority of books written in Coptic where during the last century
>>> and were mostly the same [typewriter] font.
>>> Here is a sample picture:
>>> https://imgur.com/a/ILRw6vm
>>> And sample book:
>>> https://archive.org/download/pistissophiaopu00petegoog
>>>
>>> We need to add Coptic to languages supported by Tesseract but not sure
>>> how.
>>> I tried following this document https://github.com/tesseract-o
>>> cr/tesseract/wiki/TrainingTesseract-4.00 but it's very difficult to
>>> understand.
>>>
>>> We need someone help us with the initial setup so that we can dedicate
>>> our man power to training the system.
>>> We are none profit group so we are hoping for free help but we would
>>> also consider paid help since the alternative is hundreds of hours of man
>>> labor to digitalize just few books.
>>>
>>> Thanks everyone for contributing to this awesome project
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/08869d08-8b3a-4390-be79-fa811c78c0ca%40goo
>>> glegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/08869d08-8b3a-4390-be79-fa811c78c0ca%40googlegroups.com?utm_medium=email_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/CAG2NduUcTs8WTSM0Ppwuon%2B-e1RJHiS4pjsvLng
>> YphW0yy4X2Q%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUcTs8WTSM0Ppwuon%2B-e1RJHiS4pjsvLngYphW0yy4X2Q%40mail.gmail.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUSCM9hE%3DdpD3c92om%3DsfdZq7ou3eGK%2BQ9Vvo5RPWs%3D8g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

2018-05-30 Thread ShreeDevi Kumar

I am trying a test training for coptic for tess4, will let you know where
to access traineddata.

You can train using utf-8 textand unicode optic fonts.

1. collect utf-8 text in Coptic
2. Find Coptic unicode fonts, if you can find one similar to the typewriter
font used in books it will make training easier
3. train a model with these and then finetune it with line images and
matching ground truth


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 30, 2018 at 4:09 PM, Ramast Magdy  wrote:

> Thank you ShreeDevi for both moheb's link and the one below.
> The current one uses Tesseract 3 and according to the author:
> "Recognition quality of Coptic texts containing old fonts will be very
> poor, depending on the trained data."
>
> I will get in contact with him to see if we can use the other link you
> provided
> https://github.com/OCR-D/ocrd-train
> To train Tesseract 4.00
>
> Thank you very much
>
>
> On 05/30/2018 06:31 AM, ShreeDevi Kumar wrote:
>
> See http://www.moheb.de/ocr.html
>
> It provides a traineddata file for Coptic for use with tesseract version 3.
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Tue, May 29, 2018 at 9:57 PM,  wrote:
>
>> Hi,
>> I belong to a group who study an old Egyptian writing system called
>> "Coptic".
>> It's based mostly on Greek (with some variation).
>>
>> Big majority of books written in Coptic where during the last century and
>> were mostly the same [typewriter] font.
>> Here is a sample picture:
>> https://imgur.com/a/ILRw6vm
>> And sample book:
>> https://archive.org/download/pistissophiaopu00petegoog
>>
>> We need to add Coptic to languages supported by Tesseract but not sure
>> how.
>> I tried following this document https://github.com/tesseract-o
>> cr/tesseract/wiki/TrainingTesseract-4.00 but it's very difficult to
>> understand.
>>
>> We need someone help us with the initial setup so that we can dedicate
>> our man power to training the system.
>> We are none profit group so we are hoping for free help but we would also
>> consider paid help since the alternative is hundreds of hours of man labor
>> to digitalize just few books.
>>
>> Thanks everyone for contributing to this awesome project
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/08869d08-8b3a-4390-be79-fa811c78c0ca%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/08869d08-8b3a-4390-be79-fa811c78c0ca%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/CAG2NduUcTs8WTSM0Ppwuon%2B-e1RJHiS4pjsvLngYphW0yy4X2Q%
> 40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUcTs8WTSM0Ppwuon%2B-e1RJHiS4pjsvLngYphW0yy4X2Q%40mail.gmail.com?utm_medium=email_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV1OpBCrwfohb43JD0zJJM%2Bqnfh3dvC%3D3a3Fe1a5cHYCQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

2018-05-29 Thread ShreeDevi Kumar

See http://www.moheb.de/ocr.html

It provides a traineddata file for Coptic for use with tesseract version 3.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 29, 2018 at 9:57 PM,  wrote:

> Hi,
> I belong to a group who study an old Egyptian writing system called
> "Coptic".
> It's based mostly on Greek (with some variation).
>
> Big majority of books written in Coptic where during the last century and
> were mostly the same [typewriter] font.
> Here is a sample picture:
> https://imgur.com/a/ILRw6vm
> And sample book:
> https://archive.org/download/pistissophiaopu00petegoog
>
> We need to add Coptic to languages supported by Tesseract but not sure how.
> I tried following this document https://github.com/tesseract-
> ocr/tesseract/wiki/TrainingTesseract-4.00 but it's very difficult to
> understand.
>
> We need someone help us with the initial setup so that we can dedicate our
> man power to training the system.
> We are none profit group so we are hoping for free help but we would also
> consider paid help since the alternative is hundreds of hours of man labor
> to digitalize just few books.
>
> Thanks everyone for contributing to this awesome project
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/08869d08-8b3a-4390-be79-fa811c78c0ca%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUcTs8WTSM0Ppwuon%2B-e1RJHiS4pjsvLngYphW0yy4X2Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

2018-05-29 Thread ShreeDevi Kumar

please see https://github.com/OCR-D/ocrd-train

you can use it with image files and matching ground truth text - in utf-8.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 29, 2018 at 9:57 PM,  wrote:

> Hi,
> I belong to a group who study an old Egyptian writing system called
> "Coptic".
> It's based mostly on Greek (with some variation).
>
> Big majority of books written in Coptic where during the last century and
> were mostly the same [typewriter] font.
> Here is a sample picture:
> https://imgur.com/a/ILRw6vm
> And sample book:
> https://archive.org/download/pistissophiaopu00petegoog
>
> We need to add Coptic to languages supported by Tesseract but not sure how.
> I tried following this document https://github.com/tesseract-
> ocr/tesseract/wiki/TrainingTesseract-4.00 but it's very difficult to
> understand.
>
> We need someone help us with the initial setup so that we can dedicate our
> man power to training the system.
> We are none profit group so we are hoping for free help but we would also
> consider paid help since the alternative is hundreds of hours of man labor
> to digitalize just few books.
>
> Thanks everyone for contributing to this awesome project
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/08869d08-8b3a-4390-be79-fa811c78c0ca%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW4x49te-Sgnkn7UhBO139p-5%3D3Mgh_tgQS_nE4NZcScQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Some spaces are not recognized

2018-05-29 Thread ShreeDevi Kumar

set the config variable - "preserve_interword_spaces" to 1
And as 0
For diff runs
and see if that makes any difference

On Tue 29 May, 2018, 4:30 PM ShreeDevi Kumar,  wrote:

> >The traineddata from tesseract does not have a spacing problem,
>
> Then the problem is related to training.
>
>
>
>
> On Tue 29 May, 2018, 4:16 PM Sumedhe Dissanayake, <
> sumedhedissanay...@gmail.com> wrote:
>
>>
>>
>> On Friday, May 18, 2018 at 6:32:44 PM UTC+5:30, shree wrote:
>>>
>>> image is not visible.
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Fri, May 18, 2018 at 5:39 PM, Sumedhe Dissanayake <
>>> sumedhedi...@gmail.com> wrote:
>>>
>>>> Sometimes spaces between words are ignored when tesseract is used to
>>>> recognize Sinhala text.
>>>>
>>>> - The traineddata from tesseract does not have a spacing problem, even
>>>> though there ware changes in tesseract since it was uploaded.
>>>> - The spacing problem occurs regardless of whether I start the training
>>>> from scratch or bootstrap with the traineddata from tesseract.
>>>> - The spacing problem gets worse with more training.
>>>> - Adding more space between the words during training does not make a
>>>> difference.
>>>> - Adding double space between the words during recognition solves the
>>>> problem.
>>>> - The spacing problem is not consistent, i.e. in the recognition of a
>>>> text only some of the inter-word spaces are ignored (could not figure out
>>>> any logic as to when it happens).
>>>>
>>>> I have attached a screenshot, comparing a sample of input and output
>>>> text.
>>>>
>>>> Words missing spaces are underlined.
>>>>
>>>>
>>>> <https://lh3.googleusercontent.com/-T6hAiA4VclA/Wv1HEKkrioI/IN4/hZors3-ZJq01n24E3_c_JFzhws90X-x9gCLcBGAs/s1600/Screenshot_20180517_143558.png>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/dfba845a-abe4-48fa-b834-7c64faf54f13%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/dfba845a-abe4-48fa-b834-7c64faf54f13%40googlegroups.com?utm_medium=email_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/df48ecd1-5340-47ab-8b3d-f9b02eaae89e%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/df48ecd1-5340-47ab-8b3d-f9b02eaae89e%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVcQ7P5Y%2B1Wps2M9TKkvdtYa9FNXo867uLZqf4tBroZ6A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Some spaces are not recognized

2018-05-29 Thread ShreeDevi Kumar

>The traineddata from tesseract does not have a spacing problem,

Then the problem is related to training.




On Tue 29 May, 2018, 4:16 PM Sumedhe Dissanayake, <
sumedhedissanay...@gmail.com> wrote:

>
>
> On Friday, May 18, 2018 at 6:32:44 PM UTC+5:30, shree wrote:
>>
>> image is not visible.
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Fri, May 18, 2018 at 5:39 PM, Sumedhe Dissanayake <
>> sumedhedi...@gmail.com> wrote:
>>
>>> Sometimes spaces between words are ignored when tesseract is used to
>>> recognize Sinhala text.
>>>
>>> - The traineddata from tesseract does not have a spacing problem, even
>>> though there ware changes in tesseract since it was uploaded.
>>> - The spacing problem occurs regardless of whether I start the training
>>> from scratch or bootstrap with the traineddata from tesseract.
>>> - The spacing problem gets worse with more training.
>>> - Adding more space between the words during training does not make a
>>> difference.
>>> - Adding double space between the words during recognition solves the
>>> problem.
>>> - The spacing problem is not consistent, i.e. in the recognition of a
>>> text only some of the inter-word spaces are ignored (could not figure out
>>> any logic as to when it happens).
>>>
>>> I have attached a screenshot, comparing a sample of input and output
>>> text.
>>>
>>> Words missing spaces are underlined.
>>>
>>>
>>> 
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/dfba845a-abe4-48fa-b834-7c64faf54f13%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/df48ecd1-5340-47ab-8b3d-f9b02eaae89e%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUjniDKifg%2Btr-L_Fq02mS-o2dw0Pqj7iOiZeDR4-OkPA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: use multi threads in tesseract

2018-05-28 Thread ShreeDevi Kumar

Also see https://github.com/tesseract-ocr/tesseract/issues/1317

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, May 28, 2018 at 2:45 PM, ShreeDevi Kumar <shreesh...@gmail.com>
wrote:

> Please see https://github.com/tesseract-ocr/tesseract/wiki/
> FAQ#can-i-increase-speed-of-ocr
>
> Set the maximum number of threads using the environment variable
> OMP_THREAD_LIMIT.
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Mon, May 28, 2018 at 2:35 PM, Jakob Salomonsson <
> jakob.salomons...@gmail.com> wrote:
>
>> Hello,
>>
>> Im having roughly the same problem, but related to pytesseract (maybe the
>> same answer can be applied to both of them).
>> I have tried several things, such as stating OMP_THREAD_LIMIT=4, for
>> example, when calling the pytesseract function or adding "OMP_THREAD_LIMIT
>> 4" in one or several of the config files.
>>
>> But still, no changes. Maybe Im just stating or adding in a wrong manner.
>>
>>
>> Anyone knows how to help us advance in this? It would be of great help.
>>
>>
>>
>>
>> Den måndag 28 maj 2018 kl. 09:27:00 UTC+2 skrev nick:
>>>
>>> I found OMP_THREAD_LIMIT but i don't know to change it to 20 ?!
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/02d2b68d-993a-46e3-a362-4a982f4d7de5%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/02d2b68d-993a-46e3-a362-4a982f4d7de5%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV3pAGGhjXM%3Dy-EYuGzgSn1Uqeh3ygcDTJf6LOZtsQiyA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: use multi threads in tesseract

2018-05-28 Thread ShreeDevi Kumar

Please see
https://github.com/tesseract-ocr/tesseract/wiki/FAQ#can-i-increase-speed-of-ocr

Set the maximum number of threads using the environment variable
OMP_THREAD_LIMIT.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, May 28, 2018 at 2:35 PM, Jakob Salomonsson <
jakob.salomons...@gmail.com> wrote:

> Hello,
>
> Im having roughly the same problem, but related to pytesseract (maybe the
> same answer can be applied to both of them).
> I have tried several things, such as stating OMP_THREAD_LIMIT=4, for
> example, when calling the pytesseract function or adding "OMP_THREAD_LIMIT
> 4" in one or several of the config files.
>
> But still, no changes. Maybe Im just stating or adding in a wrong manner.
>
>
> Anyone knows how to help us advance in this? It would be of great help.
>
>
>
>
> Den måndag 28 maj 2018 kl. 09:27:00 UTC+2 skrev nick:
>>
>> I found OMP_THREAD_LIMIT but i don't know to change it to 20 ?!
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/02d2b68d-993a-46e3-a362-4a982f4d7de5%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWR35agbR%3Dx7D%3D3KzvKTREhx1LSeayDOsm3KEScRq6-%2BQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: how to install this

2018-05-24 Thread ShreeDevi Kumar

On Thu, May 24, 2018 at 6:41 PM, Hiren Motwani 
wrote:

> thank you so much .. can you guide me how to use ?
>

https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage


if you want a gui, try

https://github.com/manisandro/gImageReader/releases



>
>
> On Thursday, May 24, 2018 at 6:18:53 PM UTC+5:30, Hiren Motwani wrote:
>>
>> how to install this tesseract-ocr in windows 10
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/54bbbff4-a0cc-4c30-8240-e6fdfd8b3374%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWuiCkPSaXUhPMmFAEZuxPUFc_Lnyv44oKznWd04m3zpg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] how to install this

2018-05-24 Thread ShreeDevi Kumar

https://github.com/UB-Mannheim/tesseract/wiki

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, May 24, 2018 at 6:10 PM, Hiren Motwani 
wrote:

> how to install this tesseract-ocr in windows 10
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/40f0da60-edb0-478a-9b5c-0146e3a5135f%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX_Q0uBhF5e0T76uiWD4P-iX50m2tD-qB%2B6mk0od%3DaBwg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Tesseract doesnt read tiff files correctly

2018-05-23 Thread ShreeDevi Kumar

tesseract uses leptonica. You can try that for preprocessing See an example
at

http://tpgit.github.io/UnOfficialLeptDocs/leptonica/line-removal.html

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 23, 2018 at 1:02 PM, Tomas Mikalauskas 
wrote:

>
> I am trying to read some tiff files that are receipts, invoices and so on.
> The text is in lithuanian, I've used imagemagick to remove the background
> and make the text pop more but I still get a pretty bad output.
> Here are some examples of my files.
>
>
> 
> 
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/2959e048-3519-4879-b6b8-6cb4975b0d3e%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV%2B%3D%3D4-t-iW9k%2B0_ZnBOTS5_7dyRbmUvgvWB%3DghC%3DW%2BnA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] missing a line in OCR persian

2018-05-21 Thread ShreeDevi Kumar

Seems related to open issue
https://github.com/tesseract-ocr/tesseract/issues/1339
Entire lines of text missing. Different missing when psm = 3, 6, 11

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

2018-05-22 10:45 GMT+05:30 reza :

> i used tesseract 4 beta for OCR. but the results had some missing words
> (line 2 have missed).
> i attached the PNG and results.
>
>
> 
>
>
>> می‌شود آسانتر است از زبانهایی مثل فارسی و عربی که حروف یک کلمه به یکدیگر
>> می‌چسبند. این موضوع به
>> باشند. البته در سالهای اخیر تلاش‌های قابل تقدیری از سوی برخی شرکتهای فعال
>> در زمینه پردازش تصویر انجام
>>
>> شده که برخی از آنها منجر به محصولات قابل قبولی شده‌است
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/eb8581d9-d277-462a-bf4b-a9a4146e211a%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX0cNAqEA-WcyK51Zxkyh6Q%3DOBoW1KUzWi%3DQxCQft_kYA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Training Tesseract4.0 (LSTM) on word level bounding boxes

2018-05-21 Thread ShreeDevi Kumar

You can see if   generate_line_box.py
 from
https://github.com/OCR-D/ocrd-train is helpful.

It requires single line images and matching ground truth to create the box
files.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 22, 2018 at 8:14 AM, Tao Shatoo  wrote:

> Not yet,i tried but failed.I'm waiting for the same API like you.
>
> 在 2017年8月11日星期五 UTC+8上午6:08:05，Shoaib写道：
>>
>> Hi everyone,
>>
>> I would like to train Tesseract on my own dataset comprising of word
>> images. I have the bounding box information but for the whole word instead
>> of per character. I referred to the following documentation available on
>> the topic of training Tesseract 4.0.
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>>
>> On the documentation, it is mentioned that "*The boxes only need to be
>> at the textline level. It is thus far easier to make training data from
>> existing image data.*". But later in the wiki, the box format that
>> allows boxes at text line level is said not to be implemented as of yet 
>> ("*Box
>> File Format - Second Option (NOT YET IMPLEMENTED)*"). I would therefore,
>> like to know if there is any way to train Tesseract based on just the word
>> bounding box information instead of character level information?
>>
>> Thanking you for your time in this regard.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/159baf4d-28a2-49c6-99c2-5fb1cc231ae3%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWb6C2sWpuLAsDjqj2kaKN6PT7ovkqwOtMPgmkfURw-HA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] run training and testing on gpu

2018-05-19 Thread ShreeDevi Kumar

Regarding LSTM training, please see
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

> Basically it will still run on anything with enough memory, but the
higher-end your processor is, the faster it will go. No *GPU* is needed.
(No support.)



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, May 19, 2018 at 9:51 AM, john  wrote:

> Hi all, How can i run tesseract4 (lstm version ) on GPU? Is it possible?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/9a54599c-e223-4466-b6a1-0df66d880933%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXmq1vRfASLibxVPWvLV6vzhyY0Orie5OPqPYFpOUqTXg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-18 Thread ShreeDevi Kumar

Hi Reza,

Attached are two scripts and one log file. You will need to change the
directories in the scripts.

finetune.sh and finetune log file are for a sample finetuning for eng. By
changing the language code you can run it for fas.
You can use that as a test.

plus-fas.sh is for plusminus type of finetuning for fas. It merges the
existing unicharset with the unicharset extracted from the training_text.

You will need to update the training_text file in langdata/fas
Optionally you can also review and update wordlist, numbers and punc file.

The scripts should run if you give correct directory names.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, May 19, 2018 at 9:24 AM, reza  wrote:

> hi ShreeDevi
>
> Thanks.
>
> I tested the 2 models that you have provided. The accuracy on samples
> without noise were about 98% but on scanned samples or captured images,
> were about 80%.
> but still it didn't work on different fonts.
> Could u send all files that needed for training models? I want fine tune
> the model with more fonts and diacritics .
>
> best regards
>
>
> On Friday, May 18, 2018 at 8:49:54 PM UTC+4:30, shree wrote:
>>
>> I have posted a couple of test models for Farsi at
>> https://github.com/Shreeshrii/tessdata_shreetest
>>
>> These have not been trained on text with diacritics as the normalization
>> and training process was giving error on the combining marks.
>>
>> Please give them a try and see if they provide better recognition for
>> numbers and text without combining marks.
>>
>> FYI, I do not know the Persian language so it is difficult for me to
>> gauge if results are ok or not.
>>
>> ShreeDevi
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/fe15cedc-0a2a-41fc-ac3c-b80df458a509%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVC17RjZXSkctsEYW6O6-mO-HAqJHZLZRQcfQsAxwxHeQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
ubuntu@tesseract-ocr:~/tess4training$ bash -x ./tesstrain_finetune.sh
+ MakeTraining=yes
+ MakeEval=yes
+ RunTraining=yes
+ Lang=eng
+ Continue_from_lang=eng
+ bestdata_dir=../tessdata_best
+ tessdata_dir=../tessdata
+ tesstrain_dir=../tesseract/src/training
+ langdata_dir=../langdata
+ fonts_dir=../.fonts
+ fonts_for_training='   '\''FreeSerif'\'' '
+ fonts_for_eval='   '\''Arial'\'' '
+ train_output_dir=./finetune_train_eng
+ eval_output_dir=./finetune_eval_eng
+ trained_output_dir=./finetune_trained_eng-from-eng
+ '[' yes = yes ']'
+ echo '## MAKING TRAINING DATA ##'
## MAKING TRAINING DATA ##
+ rm -rf ./finetune_train_eng
+ mkdir ./finetune_train_eng
+ echo ' run tesstrain.sh '
 run tesstrain.sh 
+ eval bash ../tesseract/src/training/tesstrain.sh --lang eng --linedata_only 
--  
noextract_font_properties --exposures 0 --fonts_dir ../.fonts --fontlist 
''\''Fr  eeSerif'\''' 
--langdata_dir ../langdata --tessdata_dir ../tessdata --training_te 
 xt ../langdata/eng/eng.training_text 
--output_dir ./finetune_train_eng
++ bash ../tesseract/src/training/tesstrain.sh --lang eng --linedata_only 
--noex  
tract_font_properties --exposures 0 --fonts_dir ../.fonts --fontlist FreeSerif 
-  -langdata_dir 
../langdata --tessdata_dir ../tessdata --training_text ../langdata  
/eng/eng.training_text --output_dir 
./finetune_train_eng

=== Starting training for language 'eng'
[Sat May 19 04:20:00 UTC 2018] /usr/local/bin/text2image --fonts_dir=../.fonts 
-  -font=FreeSerif

Re: [tesseract-ocr] Re: How can JTessBoxEditor generate lstm files ?

2018-05-18 Thread ShreeDevi Kumar

I use WSL with Moboxterm on Windows 10.

On Fri 18 May, 2018, 11:33 PM Joshua Willmot, 
wrote:

> I am using Windows Subsystem for Linux (Ubuntu). It works in exactly the
> same way as it would on normal Ubuntu.
>
> On Thursday, May 17, 2018 at 11:11:54 PM UTC+2, Quan Nguyen wrote:
>>
>> Those .sh shell scripts would not run on Windows environment. You may
>> need Cygwin or Windows Subsystem for Linux. Hope others who have experience
>> on this will chime in.
>>
>> On Thursday, May 17, 2018 at 2:35:50 AM UTC-5, Fadi Fawzi wrote:
>>>
>>> Thanks  Quan
>>> But is there a simple way to do training  process on WINDOWS, or I must
>>> adhere to Linux (Ubuntu) ?
>>>
>>> On Tue, May 15, 2018 at 5:02 AM, Quan Nguyen  wrote:
>>>
 As of today, it supports only legacy training (i.e., 3.0x version).

 Training for 4.0x is described in the Training Wiki
 
 .


 On Saturday, May 12, 2018 at 6:40:27 AM UTC-5, fadif...@gmail.com
 wrote:
>
> I am trying to add a few new characters to the arabic character set
> and
> train for them by fine tuning using JtessBoxEditor v2 beta.
>
> The box/tiff pairs are generated succesfully, but when I apply the
> executable trainer a .tr file and ara.traineddata are generated instead of
> .lstm file. According to docs, a lstm file should be generated in order to
> start lstmtraining. Please, tell me where am I wrong?.
>
 --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/tesseract-ocr/00cd6b54-3ed2-45e4-afbf-aa3c3f166e74%40googlegroups.com
 
 .

 For more options, visit https://groups.google.com/d/optout.

>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/3c27e22d-8ebd-4789-8b06-307e009cc7df%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVB6eBC8eVtKsrgLjBmaj7Qn7yUBkMPkMax0ng2mLPFcw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Error in executing new .traineddata file

2018-05-18 Thread ShreeDevi Kumar

>Tesseract Beta 4.00,  and do the same copy the .traineddata inside
tessdata,

If you have created your traineddata for 3.05, it may not be compatible
with 4.0.0beta.

On Sat 19 May, 2018, 2:26 AM Quan Nguyen,  wrote:

> The error message indicated Tesseract was looking for osa.traineddata file
> under C:\Program Files (x86)\Tesseract-OCR folder. You need to correctly
> specify the path to tessdata folder. Your oem value seems to be incorrect
> too.
>
> Run at the command prompt for full instructions:
>
> tesseract.exe --help-extra
>
> On Friday, May 18, 2018 at 4:52:57 AM UTC-5, Eman Sawalha wrote:
>>
>>
>> 
>>
>>
>> Thank you for your respond Quan Nguyen. I downloaded Tesseract Beta
>> 4.00,  and do the same copy the .traineddata inside tessdata, then add the
>> path of Tesseract into system environment variable. And I got this new
>> error :(.
>>
>>
>>
>>
>> On Wednesday, May 16, 2018 at 11:49:03 PM UTC+3, Quan Nguyen wrote:
>>>
>>> Sounds like you've trained using Tesseract 3.05, so it could run with
>>> Tesseract of that version or newer and is not backward compatible with
>>> older version 3.02.
>>>


 --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/e64ab9d5-b07a-49a3-b95e-b06515fafc72%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWpDwci8xfcEQwHtwt6CXhftwuW1i7w%3DjOhdkpZgSBKMw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-18 Thread ShreeDevi Kumar

I have posted a couple of test models for Farsi at
https://github.com/Shreeshrii/tessdata_shreetest

These have not been trained on text with diacritics as the normalization
and training process was giving error on the combining marks.

Please give them a try and see if they provide better recognition for
numbers and text without combining marks.

FYI, I do not know the Persian language so it is difficult for me to gauge
if results are ok or not.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 15, 2018 at 6:47 PM, reza  wrote:

> hi again
> thanks for your reply.
>
> i need more fonts. for examples :
> B Koodak
> B Lotus
> B Titr
> B Zar
> B Yekan
> Iran Nastaliq
>
> if needs, i send the .ttf files of that fonts ?
>
> thanks
>
>
> On Tuesday, May 15, 2018 at 5:35:10 PM UTC+4:30, shree wrote:
>>
>> I will try to put together complete steps.
>>
>> I am doing a test run for training persian.
>>
>> Are the following fonts ok for it?
>>
>>   '55_Sarchia_Kurdish' \
>>   '56_Sarchia_Kurdish_Bold Bold' \
>>   'Amiri' \
>>   'Arabic Typesetting' \
>>   'Arial' \
>>   'Arial Unicode MS' \
>>   'B Nazanin' \
>>   'B Nazanin Bold' \
>>   'Calibri' \
>>   'Courier New' \
>>   'Microsoft Sans Serif' \
>>   'Scheherazade' \
>>   'Tahoma' \
>>   'Times New Roman,' \
>>   'Traditional Arabic' \
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, May 15, 2018 at 3:59 PM, reza  wrote:
>>
>>> i test it on ubuntu , that raised error too.
>>>
>>> could u help me and send me a new bash file for fine tuning with new
>>> fonts ?
>>>
>>> i put "eng.traineddata" fil in tessdata_best folder
>>> and "eng.training_text" and "eng.traineddata" in langdata\eng
>>>
>>> is it true and sufficient ? or need more file ?
>>>
>>>
>>> thanks
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/885e3e15-e08f-4489-a0bc-2162f913495a%40goo
>>> glegroups.com
>>> 
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/e43db8d0-731e-4268-8791-9e243646f49d%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUXfFe4wtOWbgk7yA%2Bsz0NQeRGXAcKp2q%3DfjmYLc9FomA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Some spaces are not recognized

2018-05-18 Thread ShreeDevi Kumar

image is not visible.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, May 18, 2018 at 5:39 PM, Sumedhe Dissanayake <
sumedhedissanay...@gmail.com> wrote:

> Sometimes spaces between words are ignored when tesseract is used to
> recognize Sinhala text.
>
> - The traineddata from tesseract does not have a spacing problem, even
> though there ware changes in tesseract since it was uploaded.
> - The spacing problem occurs regardless of whether I start the training
> from scratch or bootstrap with the traineddata from tesseract.
> - The spacing problem gets worse with more training.
> - Adding more space between the words during training does not make a
> difference.
> - Adding double space between the words during recognition solves the
> problem.
> - The spacing problem is not consistent, i.e. in the recognition of a text
> only some of the inter-word spaces are ignored (could not figure out any
> logic as to when it happens).
>
> I have attached a screenshot, comparing a sample of input and output text.
>
> Words missing spaces are underlined.
>
>
> 
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/dfba845a-abe4-48fa-b834-7c64faf54f13%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWkgVQP7xBdRyY1WbSOadEukGAjg95Ab3xAgAuTtYyiFg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] tesseract version - Ubuntu 16.04 PPA vs compiling from tesseract-ocr github source (master-branch)

2018-05-17 Thread ShreeDevi Kumar

>  Which traineddata (english) is installed when tesseract is installed
using the Ubuntu PPA

tessdata_fast

>   Is the Ubuntu PPA version in sync with the Github master branch?

Not necessarily. But  it should be pretty close, You can look at the commit
number and date in the files at ppa.

>  Which traineddata (english) produces most accurate results among the
three i

It depends on your requirement and the kind of images you are using.

If you need legacy model, then you have to use tessdata.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, May 17, 2018 at 6:51 PM, Pushkar Pandey 
wrote:

> Hi All,
>
> Could someone answer the following questions I have?
> 1. Is the Ubuntu 16.04 PPA the latest tesseract version right from the
> GitHub master branch? Is the Ubuntu PPA version in sync with the Github
> master branch?
> 2. Which traineddata (english) is installed when tesseract is installed
> using the Ubuntu PPA. Is it the *tessdata_best *or *tessdata_fast* or the
> default *tessdata.*
> 3. Which traineddata (english) produces most accurate results among the
> three in your experience (*tessdata_best *or *tessdata_fast* or the
> default *tessdata*).
>
> I ask this because I see a few differences between the OCR output of
> Ubuntu PPA version and the compiled version of tesseract-ocr source. It
> could be due to different traineddata being used in the two cases. Not sure
> though.
>
> Thanks,
> Pushkar
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/6a22b65d-daf5-4e97-9cb2-0df563c5174c%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV7QJZT5LAJoH-7txtVDc_rm6_Hs%3DmRCAcc6W9QJNLrqA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-15 Thread ShreeDevi Kumar

I will try to put together complete steps.

I am doing a test run for training persian.

Are the following fonts ok for it?

  '55_Sarchia_Kurdish' \
  '56_Sarchia_Kurdish_Bold Bold' \
  'Amiri' \
  'Arabic Typesetting' \
  'Arial' \
  'Arial Unicode MS' \
  'B Nazanin' \
  'B Nazanin Bold' \
  'Calibri' \
  'Courier New' \
  'Microsoft Sans Serif' \
  'Scheherazade' \
  'Tahoma' \
  'Times New Roman,' \
  'Traditional Arabic' \

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 15, 2018 at 3:59 PM, reza  wrote:

> i test it on ubuntu , that raised error too.
>
> could u help me and send me a new bash file for fine tuning with new fonts
> ?
>
> i put "eng.traineddata" fil in tessdata_best folder
> and "eng.training_text" and "eng.traineddata" in langdata\eng
>
> is it true and sufficient ? or need more file ?
>
>
> thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/885e3e15-e08f-4489-a0bc-2162f913495a%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX6oiqxh3BZYfd1e0Ldm-0YBjZyULNEMkfuT0rvBF5BKg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-15 Thread ShreeDevi Kumar

Please use the latest windows binaries from
https://github.com/UB-Mannheim/tesseract/wiki provided by @stweil

How do you run bash script on windows10?

@stweil I have not tried training on windows? Do you have feedback from
others who have tried it.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 15, 2018 at 2:41 PM, reza  wrote:

> windows 10
> tesseract 4 alpha
>
>
> On Tuesday, May 15, 2018 at 1:12:20 PM UTC+4:30, shree wrote:
>>
>> What o/s are you running it on?
>>
>> Which version of tesseract?
>>
>> > ICU ERROR: U_FILE_ACCESS_ERRORERROR: /tmp/tmp.6m4B2TUln1/eng/eng.unicharset
>> does not exist or is not readable
>>
>> which version of icu library?
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, May 15, 2018 at 1:00 PM, reza  wrote:
>>
>>> i used this attached finetune.sh file ... but that raised error. could u
>>> help me ?
>>>
>>> thanks
>>>
>>>
 ## MAKING TRAINING DATA ##


> === Starting training for language 'eng'

 [Tue, May 15, 2018 11:42:36 AM] /c/Program Files
> (x86)/Tesseract-OCR/text2image --fonts_dir=C:WindowsFonts --font=Arial
> --outputbase=/tmp/font_tmp.CpgpM0lbxD/sample_text.txt
> --text=/tmp/font_tmp.CpgpM0lbxD/sample_text.txt
> --fontconfig_tmpdir=/tmp/font_tmp.CpgpM0lbxD

 Rendered page 0 to file C:/Users/asus/AppData/Local/Te
> mp/font_tmp.CpgpM0lbxD/sample_text.txt.tif


> === Phase I: Generating training images ===

 Rendering using Arial

 Rendering using Corbel

 [Tue, May 15, 2018 11:42:37 AM] /c/Program Files
> (x86)/Tesseract-OCR/text2image 
> --fontconfig_tmpdir=/tmp/font_tmp.CpgpM0lbxD
> --fonts_dir=C:WindowsFonts --strip_unrenderable_words --leading=32
> --char_spacing=0.0 --exposure=0 
> --outputbase=/tmp/tmp.6m4B2TUln1/eng/eng.Arial.exp0
> --max_pages=3 --font=Arial --text=./langdata/eng/eng.training_text

 [Tue, May 15, 2018 11:42:37 AM] /c/Program Files
> (x86)/Tesseract-OCR/text2image 
> --fontconfig_tmpdir=/tmp/font_tmp.CpgpM0lbxD
> --fonts_dir=C:WindowsFonts --strip_unrenderable_words --leading=32
> --char_spacing=0.0 --exposure=0 
> --outputbase=/tmp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0
> --max_pages=3 --font=Corbel --text=./langdata/eng/eng.training_text

 Stripped 2 unrenderable words

 Rendered page 0 to file C:/Users/asus/AppData/Local/Te
> mp/tmp.6m4B2TUln1/eng/eng.Arial.exp0.tif

 Stripped 1 unrenderable words

 Rendered page 1 to file C:/Users/asus/AppData/Local/Te
> mp/tmp.6m4B2TUln1/eng/eng.Arial.exp0.tif

 Stripped 2 unrenderable words

 Rendered page 0 to file C:/Users/asus/AppData/Local/Te
> mp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0.tif

 Stripped 1 unrenderable words

 Rendered page 1 to file C:/Users/asus/AppData/Local/Te
> mp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0.tif


> === Phase UP: Generating unicharset and unichar properties files ===

 [Tue, May 15, 2018 11:42:39 AM] /c/Program Files
> (x86)/Tesseract-OCR/unicharset_extractor --output_unicharset
> /tmp/tmp.6m4B2TUln1/eng/eng.unicharset --norm_mode 1
> /tmp/tmp.6m4B2TUln1/eng/eng.Arial.exp0.box
> /tmp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0.box

 Extracting unicharset from box file C:/Users/asus/AppData/Local/Te
> mp/tmp.6m4B2TUln1/eng/eng.Arial.exp0.box

 Extracting unicharset from box file C:/Users/asus/AppData/Local/Te
> mp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0.box

 ICU ERROR: U_FILE_ACCESS_ERRORERROR: /tmp/tmp.6m4B2TUln1/eng/eng.unicharset
> does not exist or is not readable

 ## MAKING EVAL DATA ##


> === Starting training for language 'eng'

 [Tue, May 15, 2018 11:42:40 AM] /c/Program Files
> (x86)/Tesseract-OCR/text2image --fonts_dir=C:WindowsFonts --font=Calibri
> --outputbase=/tmp/font_tmp.n0qq4iJk4q/sample_text.txt
> --text=/tmp/font_tmp.n0qq4iJk4q/sample_text.txt
> --fontconfig_tmpdir=/tmp/font_tmp.n0qq4iJk4q

 Rendered page 0 to file C:/Users/asus/AppData/Local/Te
> mp/font_tmp.n0qq4iJk4q/sample_text.txt.tif


> === Phase I: Generating training images ===

 Rendering using Calibri

 [Tue, May 15, 2018 11:42:40 AM] /c/Program Files
> (x86)/Tesseract-OCR/text2image 
> --fontconfig_tmpdir=/tmp/font_tmp.n0qq4iJk4q
> --fonts_dir=C:WindowsFonts --strip_unrenderable_words --leading=32
> --char_spacing=0.0 --exposure=0 
> --outputbase=/tmp/tmp.h0l64TAxEq/eng/eng.Calibri.exp0
> --max_pages=3 --font=Calibri --text=./langdata/eng/eng.training_text

 Stripped 2 unrenderable words

 Rendered page 0 to file C:/Users/asus/AppData/Local/Te
>

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-15 Thread ShreeDevi Kumar

What o/s are you running it on?

Which version of tesseract?

> ICU ERROR: U_FILE_ACCESS_ERRORERROR: /tmp/tmp.6m4B2TUln1/eng/eng.unicharset
does not exist or is not readable

which version of icu library?

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 15, 2018 at 1:00 PM, reza  wrote:

> i used this attached finetune.sh file ... but that raised error. could u
> help me ?
>
> thanks
>
>
>> ## MAKING TRAINING DATA ##
>>
>>
>>> === Starting training for language 'eng'
>>
>> [Tue, May 15, 2018 11:42:36 AM] /c/Program Files
>>> (x86)/Tesseract-OCR/text2image --fonts_dir=C:WindowsFonts --font=Arial
>>> --outputbase=/tmp/font_tmp.CpgpM0lbxD/sample_text.txt
>>> --text=/tmp/font_tmp.CpgpM0lbxD/sample_text.txt
>>> --fontconfig_tmpdir=/tmp/font_tmp.CpgpM0lbxD
>>
>> Rendered page 0 to file C:/Users/asus/AppData/Local/
>>> Temp/font_tmp.CpgpM0lbxD/sample_text.txt.tif
>>
>>
>>> === Phase I: Generating training images ===
>>
>> Rendering using Arial
>>
>> Rendering using Corbel
>>
>> [Tue, May 15, 2018 11:42:37 AM] /c/Program Files
>>> (x86)/Tesseract-OCR/text2image --fontconfig_tmpdir=/tmp/font_tmp.CpgpM0lbxD
>>> --fonts_dir=C:WindowsFonts --strip_unrenderable_words --leading=32
>>> --char_spacing=0.0 --exposure=0 
>>> --outputbase=/tmp/tmp.6m4B2TUln1/eng/eng.Arial.exp0
>>> --max_pages=3 --font=Arial --text=./langdata/eng/eng.training_text
>>
>> [Tue, May 15, 2018 11:42:37 AM] /c/Program Files
>>> (x86)/Tesseract-OCR/text2image --fontconfig_tmpdir=/tmp/font_tmp.CpgpM0lbxD
>>> --fonts_dir=C:WindowsFonts --strip_unrenderable_words --leading=32
>>> --char_spacing=0.0 --exposure=0 
>>> --outputbase=/tmp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0
>>> --max_pages=3 --font=Corbel --text=./langdata/eng/eng.training_text
>>
>> Stripped 2 unrenderable words
>>
>> Rendered page 0 to file C:/Users/asus/AppData/Local/
>>> Temp/tmp.6m4B2TUln1/eng/eng.Arial.exp0.tif
>>
>> Stripped 1 unrenderable words
>>
>> Rendered page 1 to file C:/Users/asus/AppData/Local/
>>> Temp/tmp.6m4B2TUln1/eng/eng.Arial.exp0.tif
>>
>> Stripped 2 unrenderable words
>>
>> Rendered page 0 to file C:/Users/asus/AppData/Local/
>>> Temp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0.tif
>>
>> Stripped 1 unrenderable words
>>
>> Rendered page 1 to file C:/Users/asus/AppData/Local/
>>> Temp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0.tif
>>
>>
>>> === Phase UP: Generating unicharset and unichar properties files ===
>>
>> [Tue, May 15, 2018 11:42:39 AM] /c/Program Files 
>> (x86)/Tesseract-OCR/unicharset_extractor
>>> --output_unicharset /tmp/tmp.6m4B2TUln1/eng/eng.unicharset --norm_mode
>>> 1 /tmp/tmp.6m4B2TUln1/eng/eng.Arial.exp0.box
>>> /tmp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0.box
>>
>> Extracting unicharset from box file C:/Users/asus/AppData/Local/
>>> Temp/tmp.6m4B2TUln1/eng/eng.Arial.exp0.box
>>
>> Extracting unicharset from box file C:/Users/asus/AppData/Local/
>>> Temp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0.box
>>
>> ICU ERROR: U_FILE_ACCESS_ERRORERROR: /tmp/tmp.6m4B2TUln1/eng/eng.unicharset
>>> does not exist or is not readable
>>
>> ## MAKING EVAL DATA ##
>>
>>
>>> === Starting training for language 'eng'
>>
>> [Tue, May 15, 2018 11:42:40 AM] /c/Program Files
>>> (x86)/Tesseract-OCR/text2image --fonts_dir=C:WindowsFonts --font=Calibri
>>> --outputbase=/tmp/font_tmp.n0qq4iJk4q/sample_text.txt
>>> --text=/tmp/font_tmp.n0qq4iJk4q/sample_text.txt
>>> --fontconfig_tmpdir=/tmp/font_tmp.n0qq4iJk4q
>>
>> Rendered page 0 to file C:/Users/asus/AppData/Local/
>>> Temp/font_tmp.n0qq4iJk4q/sample_text.txt.tif
>>
>>
>>> === Phase I: Generating training images ===
>>
>> Rendering using Calibri
>>
>> [Tue, May 15, 2018 11:42:40 AM] /c/Program Files
>>> (x86)/Tesseract-OCR/text2image --fontconfig_tmpdir=/tmp/font_tmp.n0qq4iJk4q
>>> --fonts_dir=C:WindowsFonts --strip_unrenderable_words --leading=32
>>> --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.
>>> h0l64TAxEq/eng/eng.Calibri.exp0 --max_pages=3 --font=Calibri
>>> --text=./langdata/eng/eng.training_text
>>
>> Stripped 2 unrenderable words
>>
>> Rendered page 0 to file C:/Users/asus/AppData/Local/
>>> Temp/tmp.h0l64TAxEq/eng/eng.Calibri.exp0.tif
>>
>> Stripped 1 unrenderable words
>>
>> Rendered page 1 to file C:/Users/asus/AppData/Local/
>>> Temp/tmp.h0l64TAxEq/eng/eng.Calibri.exp0.tif
>>
>>
>>> === Phase UP: Generating unicharset and unichar properties files ===
>>
>> [Tue, May 15, 2018 11:42:42 AM] /c/Program Files 
>> (x86)/Tesseract-OCR/unicharset_extractor
>>> --output_unicharset /tmp/tmp.h0l64TAxEq/eng/eng.unicharset --norm_mode
>>> 1 /tmp/tmp.h0l64TAxEq/eng/eng.Calibri.exp0.box
>>
>> Extracting unicharset from box file C:/Users/asus/AppData/Local/
>>> Temp/tmp.h0l64TAxEq/eng/eng.Calibri.exp0.box
>>
>> ICU ERROR: U_FILE_ACCESS_ERRORERROR: /tmp/tmp.h0l64TAxEq/eng/eng.unicharset
>>> does not exist or is not readable
>>
>>  combine_tessdata to extract lstm model from previous trained set 
>>
>> Extracting

Re: [tesseract-ocr] Re: Problem reading text in two columns

2018-05-11 Thread ShreeDevi Kumar

 >  I used the tessdata_fast file for English - are these different from
tessdata-ocr-eng that comes with Ubuntu?

The ppa has traineddata files from tessdata_fast. Ubuntu 18.04 will have
the same.

Older versions of ubuntu (wihout ppa) will have traineddata files for
tesseract 3.0x.

You can try all three, tessdata_fast, tessdata_best and tessdata to see
which one works best in your case - spped/accuracy wise.



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, May 11, 2018 at 8:29 AM, Brooks Johnson 
wrote:

> I've uninstalled and reinstalled from the PPA and my results resemble
> yours.  I used the tessdata_fast file for English - are these different
> from tessdata-ocr-eng that comes with Ubuntu?
>
> On Wednesday, May 9, 2018 at 3:21:12 AM UTC-5, shree wrote:
>>
>> Please try by building the latest version of tesseract from github
>>>
>>
>> or install  from links given in https://github.com/tesserac
>> t-ocr/tesseract/wiki
>>
>> I get the following output using the default eng.traineddata from the
>> three repos - tessdata, tessdata_best, tessdata_fast, without any
>> pre-processing of image.
>>
>> # tesseract receipt.png - --psm 6 --tessdata-dir ./tessdata -c
>> preserve_interword_spaces=1 -c page_separator=''
>>
>> Warning. Invalid resolution 0 dpi. Using 70 instead.
>> CUL DAIRY
>>
>> CHOBANI Y0G  $5.89 F
>> PRODUCE
>>
>> HONEYCRTSP APPLES
>>
>> 0.931b@ $2.29/ Ib $2.13 F
>> Tare Weight: 0.011b
>>
>> BANANAS
>>
>> 3.16 1b®  $0.59/ Ib   $1.86 F
>> Tare Weight: 0.011b
>>
>> BALANCE DUE   $9.88
>>
>>
>> # tesseract receipt.png - --psm 6 --tessdata-dir ./tessdata_best -c
>> preserve_interword_spaces=1 -c page_separator=''
>>
>> Warning. Invalid resolution 0 dpi. Using 70 instead.
>> CUL DAIRY
>>
>> CHOBANI Y0G  $5.89 F
>> PRODUCE
>>
>> HONEYCRISP APPLES
>>
>> 0.931b8  $2.20/ Ib $213 F
>> Tare Weight: 0.011b
>>
>> BANANAS
>>
>> 3.16 1b8 $0.59 Ib   $1.86 F
>> Tare Weight: 0.011b
>>
>> BALANCE DUE   $9.88
>>
>>
>> # tesseract receipt.png - --psm 6 --tessdata-dir ./tessdata_fast  -c
>> preserve_interword_spaces=1 -c page_separator=''
>>
>> Warning. Invalid resolution 0 dpi. Using 70 instead.
>> CUL DAIRY
>>
>> CHOBANI ¥OG  $5.89 F
>> PRODUCE
>>
>> HONEYCRISP APPLES
>>
>> 0.93 Ib @ = $2.29/ Ib $2.13 F
>> Tare Weight: 0.011b
>>
>> BANANAS
>>
>> 3.16 1b @ —$0.59/ Ib   $1.86 F
>> Tare Weight: 0.01Ib
>>
>> BALANCE DUE   $9.88
>>
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/aab49960-0ca2-477d-ba83-dce8dd2a2438%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWN%2B2gHvA3nEXBD%2BSmnErgjSiadYj270YO%2Bu6erpGhrsg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Tesseract crashed on blank page

2018-05-10 Thread ShreeDevi Kumar

which version? which o/s? which language?

what command did you use?

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, May 10, 2018 at 10:59 PM, kvc  wrote:

> Hi everyone,
>
> I launch tesseract on a blank page and it crashes and prompts error.
>
> Do you know how to fix it ?
>
> Thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/c01fbd41-9fcd-4fd3-acd7-2dfe855d9703%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXk98L_6VacK-B_v1mKzhe_iAU13dP2RUZ58FRUH0NgoA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Problem reading text in two columns

2018-05-06 Thread ShreeDevi Kumar

Which version of tesseract are you using?

Which traineddata (from which repo)

Try with --psm 6 if using tesseract 4 beta. It will recognise whole line,
rather than column

On Mon 7 May, 2018, 1:21 AM Brooks Johnson, 
wrote:

>
> 
> I was experimenting with an image of a receipt but there seems to be
> trouble reading the two columns.  I'm including a sample image so you can
> see what I was working with.  The output I get from running "tesseract
> receipt.png out" is this:
>
>
> CUL DAIRY
> CHOBANI VOG
>
> PRODUCE
>
> HONEVURISP APPLES
>
> 0.93 lb 6 $2.29/ 1b
> {are Weyght: 0.011b
>
> BANANAS
>
> 3.16 lb 9 $0,59/ lb
> Tare Weight: 0.01m
>
> BALANCEDlE
>
> $2.13
>
> $1.86
>
> $9.88
>
>
>
> There are a few typos but the biggest concern is that the $5.89 is nowhere
> to be found, but the prices that are below it manage to be included.  That
> first price is still missing after I processed the image and even used a
> different image taken under different lighting.  Am I doing something wrong
> here?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/fd5f8596-7f21-42d6-a7bb-0dcafa113a4a%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVp9NnAvScvdDkjsUCshM0XKiqPWU7hHvVctPuR%3D5pGkA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Tesseract 4.0 extracting multiple columns where one is wanted

2018-05-03 Thread ShreeDevi Kumar

Try with --psm 6

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 2, 2018 at 9:26 PM, 
wrote:

> I am using Tesseract 4.0 to extract text from scanned PDF documents. I
> first use pdftoppm to split the document into pages represented as png
> files, and then use the following command to perform OCR
>
> tesseract page.pdf stdout -l eng --psm 4
>
> The pages generally have section numbers down the left hand side of the
> page. Sometimes, these are extracted as a column of text, and the actual
> text is extracted as a second column. Since I have set --psm 4, I am
> expecting to get the entire page returned as a single column - and indeed,
> for some pages I do get what I want.
>
> Why is tesseract sometimes extracting the text in columns even when I tell
> it not to, and what can I do about it?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/0781d032-73b7-415d-97a0-485a1c3210a6%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXhS6YCcoqeN5J9cK1oooF-zDyn66uRXdFjHB3-46BzYQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Trained font - always one letter wrong

2018-05-02 Thread ShreeDevi Kumar

Your image has text in German. You will get better results using language
`deu` out of the box.

Attached are OCR results using deu.traineddata from tessdata_best and
tessdata_fast using tesseract-4.0.0-beta.1 run via command line.

#tesseract sample.tif sample-deu-fast -l deu --tessdata-dir ./tessdata_fast
--psm 6 -c preserve_interword_spaces=1
Tesseract Open Source OCR Engine v4.0.0-beta.1-207-g984a with Leptonica
Page 1

# tesseract sample.tif sample-deu-best -l deu --tessdata-dir
./tessdata_best --psm 6 -c preserve_interword_spaces=1
Tesseract Open Source OCR Engine v4.0.0-beta.1-207-g984a with Leptonica
Page 1



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 2, 2018 at 10:20 PM,  wrote:

> I attached a sample TIF
>
> hope this will work.
>
>
> Am Mittwoch, 2. Mai 2018 08:43:15 UTC+2 schrieb shree:
>>
>> Please provide a small sample image to test.
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Wed, May 2, 2018 at 11:26 AM,  wrote:
>>
>>> Training doesn't work. If i use the characters "ä, ö, ü" (which i need)
>>> in my training text, text2image says: WARNING:
>>> illegal UTF8 encountered and then creates an incorrect box/tif pair.
>>> This seems not to depend on my font, because with Arial it does the same
>>> thing.
>>> Can you help me to avoid this?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/d5cc618f-0122-4857-a677-4a92f4b13ba1%40goo
>>> glegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/853efaa3-46fa-4f09-a799-4bf5f2d402ae%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVLo1X3%3D_ZL-_AH01mWqVWNJQ_ERTwwgHzqg8ZFgR%2BawQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
29.10.2017-07:49 +49 3571   LST_IRLS
 s. 111
Einsatzdepesche:  Ausdruck am: 29.10.2017? um: 07:49
Einsatzdaten:

Gemeinde:

Ortsteil:

Straße : Haus-Nr. :

Stichwort   : H1 THL klein Auswahl:

Sondersignal: Ja

Label   : Unwetter   28.10.17

Ob jekt   !

Einsatzplan :

Melder   !

Hydrantenbuch:

Was  : URU olme Personenschaden

Hinweise:

Feuerwehrplan:

Gebäudefunk:Notschlüsselrohr: PU Anlage:
Fahrzeuge - alarmiert: (Wache/Funkkemer/Typ/Fahrtnunner )
Fahrzeuge - bereits im Einsatz: (Wache/Funkkemner/Typ/Fahrtnumner )
29.10,.2017-07;49 +49 3571   LST_IRLS   
  S. 1/1
Einsatzdepesche:  Ausdruck am: 29.10.2017 um: 07:49
Einsatzdaten:

Gemeinde :

Ortsteil:

Straße : Haus-Nr. :

Stichwort   : H1 THL klein Auswahl:

Sondersignal: Ja

Label   : Unwetter   28.10.17

Objekt   :

Einsatzplan :

Melder   :

Hydrantenbuch:

Was  : UKU olme Personenschaden

Hinweise:

Feuerwehrplan:

Gebäudefunk:Notschlüsselrohr: PU Anlage:
Fahrzeuge - alarmiert: (Wache-/Funkkemer-/Typ-Fahrtnummer)
Fahrzeuge - bereits im Einsatz: (Wache-/Funkkemer-/Typ-Fahrtnummer)

Re: [tesseract-ocr] Trained font - always one letter wrong

2018-05-02 Thread ShreeDevi Kumar

Please provide a small sample image to test.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 2, 2018 at 11:26 AM,  wrote:

> Training doesn't work. If i use the characters "ä, ö, ü" (which i need) in
> my training text, text2image says: WARNING:
> illegal UTF8 encountered and then creates an incorrect box/tif pair.
> This seems not to depend on my font, because with Arial it does the same
> thing.
> Can you help me to avoid this?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/d5cc618f-0122-4857-a677-4a92f4b13ba1%
> 40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVLfbLZ7OdHOD7xewEPZqZmQDj-1ydw6fLyfrVbkyW1sw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Do I need to call Init before every rectangle?

2018-05-01 Thread ShreeDevi Kumar

See
https://github.com/tesseract-ocr/tesseract/wiki/FAQ#there-are-inconsistent-results-from-tesseract-when-the-same-tessbaseapi-object-is-used-for-decoding-multiple-images



On Tue 1 May, 2018, 12:53 PM Ben Rogall,  wrote:

>
> I am using the baseapi to OCR a large number of small text images, most of
> which just have a few digits. If I call End() and Init() after every image,
> the results are basically perfect. If I just delete the char string and go
> on to the next image the results are much worse, with extra characters
> thrown in at the beginning of the text. Is it necessary to call Init()
> every time? It greatly slows the process. I have tried calling Clear()
> between images, but that had no effect.
> Thanks for any suggestions.
> Ben
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/60731a9a-1893-4626-96ab-a9c81f9d1409%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWQt9Bzu-4xMLKVD%2BQ3AJyU5wKuNZu2waMLC7hGuNWysA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Trained font - always one letter wrong

2018-04-30 Thread ShreeDevi Kumar

Use the latest version

4.0.0beta


On Sun 29 Apr, 2018, 1:51 PM ,  wrote:

> I did. Unfortunately they don't aswer...
> Have you any advice for me, to improve the
> training proccess? How many training texts should i use? Or is it possible
> that there is a problem with this font at all? Would help very much to find
> that out.
>
> Best regards Dave
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/b050af7c-d3bf-468f-aedc-a93c905b8855%40googlegroups.com
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX96fJdi5titHq9JP%2BELyG8L_Hvvy0C3ssUkaNFFc8wyw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] tesseract performs wrong auto-correction sometimes : how to disable it?

2018-04-29 Thread ShreeDevi Kumar

Please provide a sample image to test.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Apr 26, 2018 at 1:35 PM, Youcef  wrote:

>
> I'm using master branch with tessdata_fast models
>
> Le mercredi 25 avril 2018 18:49:22 UTC+2, shree a écrit :
>
>> Which version of tesseract are you using?
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Wed, Apr 25, 2018 at 8:29 PM, Youcef  wrote:
>>
>>> Hi,
>>>
>>>
>>> Tesseract seems to post process its prediction.
>>>
>>> Here after, what I get after OCRizing images (same font, same size
>>> images generated with text2image):
>>>
>>> - an image containing "12345678I" => `123456781`
>>> - an image containing "GLOTHUVFI" => `GLOTHUVFI`
>>> - an image containing "12345678H" => `12345678H`
>>> - an image containing "GLOTHUVFH" => `GLOTHUVFH`
>>> - an image containing "12345678A" => `123456784`
>>> - an image containing "GLOTHUVFA" => `GLOTHUVFA`
>>>
>>> It looks like Tesseract doesn't like a word with a some numbers and one
>>> letter at the end. In fact, if the letter looks like a number ("I" and "A"
>>> looks like "1" and "4" respectively), it replaces it by the closest number.
>>> I have tried to tune following parameters without any changement in the
>>> result:
>>>
>>> - segment_penalty_dict_frequent_word
>>> - language_model_penalty_chartype
>>>
>>> Thanks for any help.
>>>
>>> Regards
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/4722674d-27a1-4b8e-8c5a-9e07dbe3ca7d%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/aeec51e2-455a-494b-9eb4-9597c303e469%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXnd9-yJVFAWyyaMbSmi_Gi%2B-2jsDumXTL3Wxb7DwwLsw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Tesseract config for simple single words text and questions about learning

2018-04-29 Thread ShreeDevi Kumar

Try tesseract-4.0.0-beta

I get correct results with it from command line


# tesseract numbers-test.png numbers-test --tessdata-dir ./tessdata_fast -l
eng  --oem 1 --psm 6
Tesseract Open Source OCR Engine v4.0.0-beta.1-200-g37d20 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.

# tesseract numbers-test2.png numbers-test2 --tessdata-dir ./tessdata_fast
-l eng --oem 1 --psm 6
Tesseract Open Source OCR Engine v4.0.0-beta.1-200-g37d20 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.

# tesseract letters-test.png letters-test  --tessdata-dir ./tessdata_fast
-l eng --oem 1 --psm 6
Tesseract Open Source OCR Engine v4.0.0-beta.1-200-g37d20 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
#




ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Apr 28, 2018 at 5:03 PM, Lorenzo Blz  wrote:

>
> Hi, I'm using tesseract to recognize small fragments of text like this
> (actual images I'm using):
>
>
>
>
>
> Numers are fixed lenght (7 digits) and letters are always 2 chars
> uppercase. I'm using a whitelist (a different one depeding if the fragment
> is text or digits, I know this in advance). And it works reasonable well.
> The size of these fragments is fixed, I rescale them to the same height (54
> pixels, I could change it or add some borders). These are extracted from
> smartphone pictures so the original resolution varies a lot.
>
> I'm using lang "eng+ita" because in this way I get better results. I'm
> also using user-patterns but they are not helping much. I'm using the api
> through tesserocr  python bindings.
>
> I think there are may parameters I can fine tune but I tried a few
> (load_system_dawg, load_freq_dawg, textord_min_linesize) but none of these
> improved the results (a very small textord_min_linesize=0.2 made them
> worse, so they are being used). I've read the FAQ and the docs but there
> are really too many parameters to understand what to change and how.
>
> In particular my current problem is adaptive learning: when I process a
> large batch of pictures the result varies depending on other fragments.
> Fragments that are perfectly readable and correctly classified when
> processed individually, give different, wrong, results when processed in a
> batch (I mean reusing the same api instance for multiple images).
>
> I tried to disable it but it looks like
>  it cannot be
> disabled when using multiple languages(?).
>
> If I use only "ita" (and no whitelist, no learning) the first image in
> this post is recognized as (text [confidence]):
>
> ('5748788\n\n', [81])
> ('5748788\n\n', [81])
> ('5748788\n\n', [81])
> ('5748788\n\n', [81])
>
> With learning (multiple calls, no whitelist, lang: ita):
>
> ('5748788\n\n', [81])
> ('5748788\n\n', [81])
> ('5748788\n\n', [90])
> ('5748788\n\n', [90])
>
> so it improves to a higher confidence (I do not know how much the
> confidence value matters in real life). It looks like learning is doing
> something good even with no whitelist (I could use the whitelist anyway,
> just to be sure, but the starting point looks better).
>
> I'm wondering if I can do some kind of "warmup" with learning enabled and
> later turn it off (I'll try this today). But how many samples do I need?
> And it seems a little hacky.
>
> Or maybe there is some way to print debug informations from the learning
> part to see what parameters are changed and set them manually later (I
> tried a few debug params but got no output).
>
> Or maybe it is quite easy to manually find good parameters for this kind
> of regular text to get close to 90 confidence.
>
> On the "AT" fragment I get 89 confidence and I think it may be quite low
> for this kind of simple clean text.
>
> What I need are (good) consistent results in all situations for the same
> image. What do you think?
>
>
> Thanks, bye
>
> Lorenzo
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/563f2458-d63f-4198-8e73-abc448112423%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to

Re: [tesseract-ocr] Trained font - always one letter wrong

2018-04-29 Thread ShreeDevi Kumar

Check that your training text has enough samples for d.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Apr 29, 2018 at 1:51 PM,  wrote:

> I did. Unfortunately they don't aswer...
> Have you any advice for me, to improve the
> training proccess? How many training texts should i use? Or is it possible
> that there is a problem with this font at all? Would help very much to find
> that out.
>
> Best regards Dave
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/b050af7c-d3bf-468f-aedc-a93c905b8855%
> 40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWkDCeSCDhGqP5rMSxhP%3D0SdGCuK5NmYWCE4FkXcpOjbw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] tesseract 4 beta: openCL useage

2018-04-28 Thread ShreeDevi Kumar

@zdenko This discussion maybe better suited for tesseract-dev forum or do
you want to track it as a issue on github?

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Apr 28, 2018 at 1:19 PM, Janpieter Sollie <janpietersol...@gmail.com
> wrote:

> Would it be a problem for you if I rewrite the opencl engine completely,
> and you people provide me help to link the tesseract kernel -> opencl
> engine parts?
> in attachment, I already have a list of features I'd like to port to
> openCL.  As this uses the GPU in a heavy way, I will implement multi-card
> support on the host.
> Is it a problem for you guys to think of tesseract 5.0 as a milestone?
>
>
> 2018-04-27 15:53 GMT+00:00 Janpieter Sollie <janpietersol...@gmail.com>:
>
>> if I'm right, a neural net is about the engine parts, not the image
>> characterisation rendering method, am I right? because I see many
>> presentations, and most of them talk about the history of tesseract, but
>> that's not what I need
>>
>> 2018-04-27 14:27 GMT+00:00 ShreeDevi Kumar <shreesh...@gmail.com>:
>>
>>> Please see
>>>
>>> https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM
>>>
>>> For info about neural nets used by tesseract
>>>
>>> On Fri 27 Apr, 2018, 7:48 PM Janpieter Sollie, <
>>> janpietersol...@gmail.com> wrote:
>>>
>>>> I had a quick thought about what you could offload to opencl.  I will
>>>> need some help from you people (I am a C programmer, not C++, at least not
>>>> experienced) to do the host code, but this algorithm is perfectly
>>>> optimizeable in openCL.
>>>> the way I'd do it:
>>>>
>>>> prerequirements:
>>>> - you can define 65k offsets (x,y) in whose you want the openCL engine
>>>> to look for dots (x,y), the optimal position and closest neighbour can be
>>>> reported in the first part.
>>>> - you can make a RAW image of both the image and the characters. size
>>>> of the letters doesn't matter, but they must be trimmed properly
>>>>
>>>> 1. you give me a matrix of 256*256 offsets(short, short) to analyze,
>>>> with a max of 64 dots (char, char) (I assume these are neurons) to analyze
>>>> in each offset.
>>>> so, this gives you a start memory usage of   2⁸ * 2⁸ *4 + 64*2 = 256k +
>>>> 128 bytes
>>>> each dot MUST contain a black pixel.
>>>> then we add the image, this is a charimage of max (to be discussed with
>>>> you guys), I assume a 4096*4096 pixel image would be fine, especially when
>>>> a character can contain a 4x4 matrix defining a 0/1 (black/white) value.
>>>> 2. Then I follow these steps in the openCL engine:
>>>> - we analyze the neurons
>>>> - draw a cirle around them of x black points. (this circle can be
>>>> 0, in which case the  neuron is white), for which the circle is completely
>>>> black
>>>> - when we encounter one or more white points, a direction of the
>>>> points is calculated. if there's no whitespace at the other side, the
>>>> neuron offset is moved for x/2 in the opposite direction and analyze neuron
>>>> is restarted for x/2.  else, quit the 'analyze neuron' part.  This can be
>>>> done in local memory, in which case it will cost you 256*2=512 bytes of
>>>> local ram to determine the optimal neuron position. Most graphic cards have
>>>> a limit of 32k ram, so this is no problem :-)
>>>> - determine the closest dot next to this one:
>>>> for each dot != this one, draw a line of black points, if no line
>>>> can be found, jump to next dot.
>>>> watch distance.  If it's smaller than the previous neuron && this
>>>> dot id hasn't a link pointing from the destination to this one, save dot 
>>>> id.
>>>> so, at the end:
>>>> - each neuron of each offset is optimally centered in a return
>>>> matrix of 256*256*64*2 = 2²³ = 8M of memory
>>>> - each neuron has a unique id to its closest neighbour, to which
>>>> it's guaranteed to be attached. an id of -1 means no id could be found.
>>>> 256*256*64 = 4M of memory
>>>>
>>>> 3. we focus on neuron list -> character mapping. this is a separate
>>>> kernel. A "probability" factor is involved here, but I will think about it
>>>> further.  I suggest to use a list of 64 ch

Re: [tesseract-ocr] tesseract 4 beta: openCL useage

2018-04-27 Thread ShreeDevi Kumar

Please see

https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM

For info about neural nets used by tesseract

On Fri 27 Apr, 2018, 7:48 PM Janpieter Sollie, 
wrote:

> I had a quick thought about what you could offload to opencl.  I will need
> some help from you people (I am a C programmer, not C++, at least not
> experienced) to do the host code, but this algorithm is perfectly
> optimizeable in openCL.
> the way I'd do it:
>
> prerequirements:
> - you can define 65k offsets (x,y) in whose you want the openCL engine to
> look for dots (x,y), the optimal position and closest neighbour can be
> reported in the first part.
> - you can make a RAW image of both the image and the characters. size of
> the letters doesn't matter, but they must be trimmed properly
>
> 1. you give me a matrix of 256*256 offsets(short, short) to analyze, with
> a max of 64 dots (char, char) (I assume these are neurons) to analyze in
> each offset.
> so, this gives you a start memory usage of   2⁸ * 2⁸ *4 + 64*2 = 256k +
> 128 bytes
> each dot MUST contain a black pixel.
> then we add the image, this is a charimage of max (to be discussed with
> you guys), I assume a 4096*4096 pixel image would be fine, especially when
> a character can contain a 4x4 matrix defining a 0/1 (black/white) value.
> 2. Then I follow these steps in the openCL engine:
> - we analyze the neurons
> - draw a cirle around them of x black points. (this circle can be 0,
> in which case the  neuron is white), for which the circle is completely
> black
> - when we encounter one or more white points, a direction of the
> points is calculated. if there's no whitespace at the other side, the
> neuron offset is moved for x/2 in the opposite direction and analyze neuron
> is restarted for x/2.  else, quit the 'analyze neuron' part.  This can be
> done in local memory, in which case it will cost you 256*2=512 bytes of
> local ram to determine the optimal neuron position. Most graphic cards have
> a limit of 32k ram, so this is no problem :-)
> - determine the closest dot next to this one:
> for each dot != this one, draw a line of black points, if no line can
> be found, jump to next dot.
> watch distance.  If it's smaller than the previous neuron && this dot
> id hasn't a link pointing from the destination to this one, save dot id.
> so, at the end:
> - each neuron of each offset is optimally centered in a return matrix
> of 256*256*64*2 = 2²³ = 8M of memory
> - each neuron has a unique id to its closest neighbour, to which it's
> guaranteed to be attached. an id of -1 means no id could be found.
> 256*256*64 = 4M of memory
>
> 3. we focus on neuron list -> character mapping. this is a separate
> kernel. A "probability" factor is involved here, but I will think about it
> further.  I suggest to use a list of 64 character images at once, otherwise
> you need lots of memory :-)
> - define the top, left and right neuron. create a zoom factor for the
> image. calculate the aspect ratio.  The probability is
> 1-diff(aspect_ratio1, aspect_ratio2)
> - analyze each link in the font character. total probability *=
> (found_link_length / total_link_length)
> - report the probability.
> On the PC: the character with the highest probability is the character you
> 're looking for.  Be aware that you need to compare the possibilities of
> the different offsets if they overlap.
>
> if the tesseract project can use this, please let me know
>
> 2018-04-27 9:36 GMT+00:00 Zdenko Podobny :
>
>> Only documentation we have is code itself ;-) But you can start with
>> searching for opencl issue in tesseract issue tracker on github...
>>
>> Zdenko
>>
>>
>> pi 27. 4. 2018 o 10:56 Janpieter Sollie 
>> napísal(a):
>>
>>> I'd be glad to help.  using tesseract 4, I am able to perform a 90%
>>> accuracy on OpenCL.  I do not have any experience with neural networks (i'm
>>> just a high-school (no college educated IT-support guy with some knowledge
>>> about OpenCL), so can you recommend me some documentation to understand the
>>> engine of tesseract 4?
>>>
>>> 2018-04-27 10:50 GMT+02:00 Zdenko Podobny :
>>>
 If you have experience your help will be warmly welcomed.
 OpenCL is not maintained and it is on good way to be removed if
 maintainer/contributor will not be found.
 Anyway it is not used extensively, so there is a place for improvement,

 Zdenko


 pi 27. 4. 2018 o 10:21 Janpieter Sollie 
 napísal(a):

> Hello everyone,
>
> I have a question about the openCL selection procedure of tesseract:
>
> my output:
>
> [DS] Profile read from file (tesseract_opencl_profile_devices.dat).
> [DS] Device[1] 1:Fiji score is 0.202927
> [DS] Device[2] 1:Ellesmere score is 1.468799
> [DS] Device[3] 1:Ellesmere score is 1.468799
> [DS] Device[4] 1:Bonaire score is 1.533776

Re: [tesseract-ocr] Problem facing with tessearct training 4 with arabic

2018-04-25 Thread ShreeDevi Kumar

You are trying to train only digits but then using the unicharset which has
these numbers only for compressing the wordlist (which uses Arabic
alphabet)  to a 'dawg'.

The command you have used only creates the starter traineddata for LSTM
training. Please follow the instructions given in the wiki page related to
training tesseract4.

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Apr 26, 2018 at 4:59 AM, Amir Raouf  wrote:

> First The arabic is read by tesseract with good accuracy but NO DIGITS
> read so I decided to train only numbers with specific font I need
>
> This is the question https://stackoverflow.com/
> questions/50029477/issue-with-training-tesseract-4-0
>
> Any advice
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/8d352529-8cdf-4e83-ba96-691abbd74423%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX1%2B3ZhhTrAsxqXXN%3DgCk91xKfvum7WwQGooncWNnY2Rw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] tesseract performs wrong auto-correction sometimes : how to disable it?

2018-04-25 Thread ShreeDevi Kumar

Which version of tesseract are you using?

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Apr 25, 2018 at 8:29 PM, Youcef  wrote:

> Hi,
>
>
> Tesseract seems to post process its prediction.
>
> Here after, what I get after OCRizing images (same font, same size images
> generated with text2image):
>
> - an image containing "12345678I" => `123456781`
> - an image containing "GLOTHUVFI" => `GLOTHUVFI`
> - an image containing "12345678H" => `12345678H`
> - an image containing "GLOTHUVFH" => `GLOTHUVFH`
> - an image containing "12345678A" => `123456784`
> - an image containing "GLOTHUVFA" => `GLOTHUVFA`
>
> It looks like Tesseract doesn't like a word with a some numbers and one
> letter at the end. In fact, if the letter looks like a number ("I" and "A"
> looks like "1" and "4" respectively), it replaces it by the closest number.
> I have tried to tune following parameters without any changement in the
> result:
>
> - segment_penalty_dict_frequent_word
> - language_model_penalty_chartype
>
> Thanks for any help.
>
> Regards
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/4722674d-27a1-4b8e-8c5a-9e07dbe3ca7d%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWDQt2VBHB%2Bhjba4hNMS-nhqEqeZ9T4PgwOZPys3unzmA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Box file generator combines vertical lines across rows of text

2018-04-24 Thread ShreeDevi Kumar

Please provide a sample tiff, single page will do, for testing.



On 25-Apr-2018 2:00 AM, "Cameron McSweeney"  wrote:

Yes, and the box files 4.0 made still had the same problem. The accuracy
with 4.0 was much better but it still needs some tweaking, so I figured I
would be better off fixing the problem in 3.05


>
-- 
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/1fbc2f34-2f8f-474e-81de-3a63565de8ad%40googlegroups.com

.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVZ1zFL5-AWvOWLEDG%3D%2BKMTvH%2BLGj0G0dfVbi%3DWA4-Fug%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Box file generator combines vertical lines across rows of text

2018-04-24 Thread ShreeDevi Kumar

Have you tried the latest version, tesseract 4.0.0beta?

On Wed 25 Apr, 2018, 12:03 AM Cameron McSweeney, 
wrote:

> Tesseract seems to be much too willing to find vertical lines. For
> example, Ds will be divided so that the straight, left portion is separate
> from the right, curved portion. The font is fixed, so stuff like that
> shouldn't happen
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/3bf929b0-1446-47ac-9a68-eaa376b63c71%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVBFqqKnVk3m9o0dqrzqA3QNfTHs2_PqbHLFHdE4xUg8A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Install Tesseract 4 on CentOS and Red Hat [SOLVED!]

2018-04-24 Thread ShreeDevi Kumar

I have never used equ.traineddata. From feedback in the forum I don't think
it works very well. Maybe equ has not been trained via LSTM training, I
have no way of knowing. Only Ray Smith or other developers from Google can
answer that.

Only LSTM models exist in tessdata_best and tessdata_fast.

Depending on the language and the hardware that you are running on,
tesseract 4 can be slower than tesseract 3 - see various issues related to
performance on GitHub. However accuracy has improved a lot and a larger
number of languages are available for tesseract 4.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Apr 24, 2018 at 9:07 PM, Eugene Huang  wrote:

> @Shree
> Thanks for the tip. Just 2 quick questions.
> 1) From https://github.com/tesseract-ocr/tesseract/wiki/Data-Files, it
> says that "osd" and "equ" traineddata files are compatible between
> Tesseract 3 and 4. In the GitHub tessdata_fast repo (
> https://github.com/tesseract-ocr/tessdata_fast), "osd" is there with the
> commit "Use legacy Orientation Script Detector (OSD) because that is the
> only thing that currently works." However, "equ" is not in the repo. Was
> this simply a small mistake where the maintainer forgot to include the
> "equ" data file?
>
> 2) Also, with tessdata_fast, I was able to get Tesseract 4 running faster
> than using Tesseract 4 with tessdata. However, is Tesseract 4 supposed to
> be slower than Tesseract 3 because that's what I'm experiencing?
>
>
>
>
> # Here are the updated instructions to download tessdata_fast, which I
> tested to indeed perform faster than tessdata.
> # However, when calling Tesseract from the command line, using the
> arguments "--oem 2" will no longer work.
> # Use "--oem 1" since only the neural net LSTM model exists if using
> tessdata_fast.
> wget https://github.com/tesseract-ocr/tessdata_fast/blob/master/
> osd.traineddata?raw=true
> wget https://github.com/tesseract-ocr/tessdata_fast/blob/master/
> eng.traineddata?raw=true
> wget https://github.com/tesseract-ocr/tessdata_fast/blob/master/
> chi_sim.traineddata?raw=true
>
>
> On Monday, April 23, 2018 at 2:37:09 PM UTC-4, shree wrote:
>>
>> Thanks for the script to install tesseract on CentOS.
>>
>> I would suggest using traineddata files from tessdata_fast or
>> tessdata_best repos for better accuracy and speed.
>>
>> On Mon 23 Apr, 2018, 11:52 PM Eugene Huang,  wrote:
>>
>>> Hello! Most people are probably running Tesseract 4 on Ubuntu, MacOS,
>>> and Windows. Unfortunately, there are no clear instructions on installing
>>> Tesseract 4 for other flavors of Linux--probably most notably CentOS and
>>> Red Hat.
>>>
>>> After going through dependency hell, I successfully installed Tesseract
>>> 4 onto CentOS 7. I presume that the installation script should also work
>>> for Red Hat. I want to give credit to EisenVault because this script is
>>> essentially a modified version of his script. This is my first contribution
>>> to open source software, so any tips will be highly appreciated!
>>>
>>> When running this script line by line, you probably have to prefix
>>> "sudo" to each line, or you can copy and paste into a bash script and then
>>> run sudo along with the script. I have tested both to work on a fresh image
>>> of CentOS 7 on VirtualBox.
>>>
>>> Cheers!
>>>
>>> # (Estimated Time of Completion: 45 minutes)
>>> # Instructions taken (and slightly modified) from
>>> https://github.com/EisenVault/install-tesseract-redhat-cento
>>> s/blob/master/install-tesseract.sh
>>> cd /opt
>>> # The following line will take 30 minutes to install.
>>> yum -y update
>>> yum -y install libstdc++ autoconf automake libtool autoconf-archive 
>>> pkg-config
>>> gcc gcc-c++ make libjpeg-devel libpng-devel libtiff-devel zlib-devel
>>> yum group install -y "Development Tools"
>>>
>>>
>>> # Install Leptonica from Source
>>> wget http://www.leptonica.com/source/leptonica-1.75.3.tar.gz
>>> tar -zxvf leptonica-1.75.3.tar.gz
>>> cd leptonica-1.75.3
>>> ./autobuild
>>> ./configure
>>> make -j
>>> make install
>>> cd ..
>>> # Delete tar.gz file if you like
>>>
>>>
>>> # Sanity checks
>>> # check if libpng is installed: type "whereis libpng" and expect to see
>>> a directory; a blank line is not good
>>> # check if leptonica is installed: type "ls /usr/local/include" and
>>> expect to see "leptonica"
>>>
>>>
>>> # Install Tesseract from Source
>>> wget https://github.com/tesseract-ocr/tesseract/archive/4.0.0-bet
>>> a.1.tar.gz
>>> tar -zxvf 4.0.0-beta.1.tar.gz
>>> cd tesseract-4.0.0-beta.1/
>>> ./autogen.sh
>>> PKG_CONFIG_PATH=/usr/local/lib/pkgconfig 
>>> LIBLEPT_HEADERSDIR=/usr/local/include
>>> ./configure --with-extra-includes=/usr/local/include --with-extra-
>>> libraries=/usr/local/lib
>>> LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make -j
>>> make install
>>> ldconfig
>>> cd ..
>>> # Delete tar.gz file if you like
>>>
>>>
>>> #

Re: [tesseract-ocr] Install Tesseract 4 on CentOS and Red Hat [SOLVED!]

2018-04-23 Thread ShreeDevi Kumar

Thanks for the script to install tesseract on CentOS.

I would suggest using traineddata files from tessdata_fast or tessdata_best
repos for better accuracy and speed.

On Mon 23 Apr, 2018, 11:52 PM Eugene Huang,  wrote:

> Hello! Most people are probably running Tesseract 4 on Ubuntu, MacOS, and
> Windows. Unfortunately, there are no clear instructions on installing
> Tesseract 4 for other flavors of Linux--probably most notably CentOS and
> Red Hat.
>
> After going through dependency hell, I successfully installed Tesseract 4
> onto CentOS 7. I presume that the installation script should also work for
> Red Hat. I want to give credit to EisenVault because this script is
> essentially a modified version of his script. This is my first contribution
> to open source software, so any tips will be highly appreciated!
>
> When running this script line by line, you probably have to prefix "sudo"
> to each line, or you can copy and paste into a bash script and then run
> sudo along with the script. I have tested both to work on a fresh image of
> CentOS 7 on VirtualBox.
>
> Cheers!
>
> # (Estimated Time of Completion: 45 minutes)
> # Instructions taken (and slightly modified) from
> https://github.com/EisenVault/install-tesseract-redhat-centos/blob/master/install-tesseract.sh
> cd /opt
> # The following line will take 30 minutes to install.
> yum -y update
> yum -y install libstdc++ autoconf automake libtool autoconf-archive pkg-config
> gcc gcc-c++ make libjpeg-devel libpng-devel libtiff-devel zlib-devel
> yum group install -y "Development Tools"
>
>
> # Install Leptonica from Source
> wget http://www.leptonica.com/source/leptonica-1.75.3.tar.gz
> tar -zxvf leptonica-1.75.3.tar.gz
> cd leptonica-1.75.3
> ./autobuild
> ./configure
> make -j
> make install
> cd ..
> # Delete tar.gz file if you like
>
>
> # Sanity checks
> # check if libpng is installed: type "whereis libpng" and expect to see a
> directory; a blank line is not good
> # check if leptonica is installed: type "ls /usr/local/include" and expect
> to see "leptonica"
>
>
> # Install Tesseract from Source
> wget https://
> github.com/tesseract-ocr/tesseract/archive/4.0.0-beta.1.tar.gz
> tar -zxvf 4.0.0-beta.1.tar.gz
> cd tesseract-4.0.0-beta.1/
> ./autogen.sh
> PKG_CONFIG_PATH=/usr/local/lib/pkgconfig LIBLEPT_HEADERSDIR=/usr/local/include
> ./configure --with-extra-includes=/usr/local/include --with-extra-
> libraries=/usr/local/lib
> LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make -j
> make install
> ldconfig
> cd ..
> # Delete tar.gz file if you like
>
>
> # Download and install tesseract language files (Tesseract 4 traineddata
> files)
> wget https://github.com/tesseract-ocr/tessdata/raw/master/osd.traineddata
> wget https://github.com/tesseract-ocr/tessdata/raw/master/equ.traineddata
> wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
> wget https://
> github.com/tesseract-ocr/tessdata/raw/master/chi_sim.traineddata
> # download another other languages you like
> mv *.traineddata /usr/local/share/tessdata
>
>
> # Sanity check
> # check if tesseract is installed: type "tesseract --version" and expect
> to see 1st line (tesseract), 2nd line (leptonica), 3rd line(libraries for
> images)
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d41ebcc5-b3b1-4e66-af8a-c7896814a7cc%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUtn3-BLdzi-Sx2tKVpLyKWGXPZt6%2BvOVd1EJdP1K4SnA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Unsure why tesseract isn't returning the correct text

2018-04-22 Thread ShreeDevi Kumar

Yes, please use the latest code from github master branch for building.
That way you will have all the bug fixes and updates.

On Sun 22 Apr, 2018, 2:42 AM 'DR' via tesseract-ocr, <
tesseract-ocr@googlegroups.com> wrote:

> I double checked, there seems to be a 4.0.0-beta.1 tag. I assume you
> installed that using git?
>
>
> On Saturday, April 21, 2018 at 2:40:20 PM UTC-6, zdenop wrote:
>>
>> Really? Did you check it before writing to forum?
>>
>> Zdenko
>>
>> 2018-04-21 22:25 GMT+02:00 'DR' via tesseract-ocr <
>> tesser...@googlegroups.com>:
>>
>>> Where can I find tesseract 4 beta? The github repo goes up to 4 alpha.
>>>
>>> On Saturday, April 21, 2018 at 2:21:49 PM UTC-6, zdenop wrote:

 Time for upgrade?

 Zdenko

 2018-04-21 22:14 GMT+02:00 'DR' via tesseract-ocr <
 tesser...@googlegroups.com>:

> I'm using:
>
> tesseract 3.04.01
>  leptonica-1.73
>   libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 :
> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
>
>
>
> On Saturday, April 21, 2018 at 2:48:15 AM UTC-6, shree wrote:
>>
>>
>> BLAZIKEN-M RAPIDASH-M VICTREEBEL-M SHRRPEDO-M PORYGON-I-M  RAZELF-M
>>
>> with
>>
>>  tesseract -v
>> tesseract 4.0.0-beta.1-133-g5435c
>>  leptonica-1.76.0
>>   libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 :
>> zlib 1.2.8 : libopenjp2 2.3.0
>>  Found AVX
>>  Found SSE
>>
>> tesseract names.png - --tessdata-dir ./tessdata_best
>> Warning. Invalid resolution 0 dpi. Using 70 instead.
>> Estimating resolution as 547
>> BLAZIKEN-M RAPIDASH-M VICTREEBEL-M SHRRPEDO-M PORYGON-I-M  RAZELF-M
>>
>>
>> Which version of tesseract are you using?
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Sat, Apr 21, 2018 at 6:32 AM, 'DR' via tesseract-ocr <
>> tesser...@googlegroups.com> wrote:
>>
>>> I have this image I want to turn into text:
>>>
>>>
>>> 
>>> To clean it up, I've used Fred's textcleaner script (
>>> http://www.fmwconcepts.com/imagemagick/textcleaner/index.php) and
>>> ran
>>>
>>> ./textcleaner -i 2 names.png result.png

>>>
>>> on the image, the result is now:
>>>
>>>
>>> 
>>> It looks a lot cleaner, so now I use tesseract to turn it into text:
>>>
>>> tesseract result.png stdout -psm 7 -l eng --user-words
 /path/to/eng.user-words --user-patterns /path/to/eng.user-patterns
>>>
>>>
>>> with the following files,  eng.user-words:
>>>
>>> BLAZIKEN
 RAPIDASH
 VICTREEBEL
 SHARPEDO
 PORYGON-Z
 AZELF
>>>
>>>
>>> eng.user-pattern:
>>>
>>> -M
>>>
>>>
>>> & /path/to/configs/bazaar:
>>>
>>> load_system_dawg F
 load_freq_dawg   F
 user_words_suffixuser-words
 user_patterns_suffix user-patterns
>>>
>>>
>>> Yet my output is:
>>>
>>> Bl*H*ZIKEN-M R*H*PID*H*SH-M V*lE*TREEBEl-M SH*H*RPE*IIIJ*-M P*U*RY
 *Eﬂ*N-Z-M *H*ZELF-M
>>>
>>>
>>> Since case isn't an issue for me, the only problems are "A" showing
>>> up as "H", "C" showing up as "LE", "DO" showing up as "IIIJ", and "GO"
>>> showing up as "Efl" (with "fl" being one character).
>>>
>>> I'm not sure how to make the image any clearer if possible or if I'm
>>> doing something wrong with tesseract. Any help is appreciated.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it,
>>> send an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/cc3d86fb-4d9f-4e77-a5dd-23a41df213e3%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com.
> To post to this group, send

Re: [tesseract-ocr] "jav" language -- is it Javanese Script or Latin-based text?

2018-04-22 Thread ShreeDevi Kumar

Seems to be in Latin script

see
https://github.com/tesseract-ocr/langdata/blob/master/jav/jav.training_text

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Apr 22, 2018 at 2:58 PM, Christopher Imantaka Halim <
topher.halim...@gmail.com> wrote:

> Hi everyone,
>
> I'm new to Tesseract OCR, want to develop an OCR for Javanese Script /
> Aksara.
>
> Noticed that Tesseract 4.0 already have a "jav" language package:
>
> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
>
> The question, is it for Javanese Script or for Javanese in Latin text?
>
> https://en.wikipedia.org/wiki/Javanese_script
>
> Thanks before
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/14828355-77e1-41ba-b705-5a8a3801e077%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVZNJOMOn45HZX4Z-zPoQ-%3DEicEM%2Bi6k%3DUywCNEXJaGAg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Unsure why tesseract isn't returning the correct text

2018-04-21 Thread ShreeDevi Kumar

BLAZIKEN-M RAPIDASH-M VICTREEBEL-M SHRRPEDO-M PORYGON-I-M  RAZELF-M

with

 tesseract -v
tesseract 4.0.0-beta.1-133-g5435c
 leptonica-1.76.0
  libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib
1.2.8 : libopenjp2 2.3.0
 Found AVX
 Found SSE

tesseract names.png - --tessdata-dir ./tessdata_best
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 547
BLAZIKEN-M RAPIDASH-M VICTREEBEL-M SHRRPEDO-M PORYGON-I-M  RAZELF-M


Which version of tesseract are you using?

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Apr 21, 2018 at 6:32 AM, 'DR' via tesseract-ocr <
tesseract-ocr@googlegroups.com> wrote:

> I have this image I want to turn into text:
>
>
> 
> To clean it up, I've used Fred's textcleaner script (
> http://www.fmwconcepts.com/imagemagick/textcleaner/index.php) and ran
>
> ./textcleaner -i 2 names.png result.png
>>
>
> on the image, the result is now:
>
>
> 
> It looks a lot cleaner, so now I use tesseract to turn it into text:
>
> tesseract result.png stdout -psm 7 -l eng --user-words
>> /path/to/eng.user-words --user-patterns /path/to/eng.user-patterns
>
>
> with the following files,  eng.user-words:
>
> BLAZIKEN
>> RAPIDASH
>> VICTREEBEL
>> SHARPEDO
>> PORYGON-Z
>> AZELF
>
>
> eng.user-pattern:
>
> -M
>
>
> & /path/to/configs/bazaar:
>
> load_system_dawg F
>> load_freq_dawg   F
>> user_words_suffixuser-words
>> user_patterns_suffix user-patterns
>
>
> Yet my output is:
>
> Bl*H*ZIKEN-M R*H*PID*H*SH-M V*lE*TREEBEl-M SH*H*RPE*IIIJ*-M P*U*RY*Eﬂ*N-Z-M
>> *H*ZELF-M
>
>
> Since case isn't an issue for me, the only problems are "A" showing up as
> "H", "C" showing up as "LE", "DO" showing up as "IIIJ", and "GO" showing up
> as "Efl" (with "fl" being one character).
>
> I'm not sure how to make the image any clearer if possible or if I'm doing
> something wrong with tesseract. Any help is appreciated.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/cc3d86fb-4d9f-4e77-a5dd-23a41df213e3%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV%2BhWxicE7n82e3VrzuBmGe5wFhTaHAEp2Gf-Yeb5ievg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Train Tesseract 4.0 on Windows 8

2018-04-19 Thread ShreeDevi Kumar

tesstrain.sh is a bashshell  script. You don't need python for it.

try the following: (give the correct path)

bash ./tesstrain.sh



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Apr 19, 2018 at 8:01 PM,  wrote:

> I have installed the lastest tesseract 4.0 binary from UB Mannheim, along
> with python, Git & Java on my Windows 8 64bit.
> I am trying to run the "tesstrain.sh" script, but an erro message appears,
> any help?
>
>
> 
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/8a9b6f88-2770-423f-b566-54846e9e2586%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUy52m8VnxfaCedhMqtLGsWhuJLHnKBN_Yf_qCVCxQeiQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] How can I know whichever file format types Tesseract will recognize and able to process them ?

2018-04-18 Thread ShreeDevi Kumar

It depends on which image libraries leptonica was built with.

tesseract -v
will show the list


On Thu 19 Apr, 2018, 10:46 AM abdu,  wrote:

> How do we get information for the file types in that Tesseract would
> capable of processing ?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/21e695da-0581-472b-8651-f37fb6624159%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUgFLCyNChqa2O1E4gkgf9hdtkoPMJ1xE%2BAJXrRWBFRZQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Training tessract 4.0 using images?

2018-04-15 Thread ShreeDevi Kumar

Please take a look at tesstrain_utils.sh and language-specific.sh in
training directory for more details about how training works.

As mentioned before training with box/tiff pairs is not supported.



On Mon 16 Apr, 2018, 8:19 AM ,  wrote:

> Hi Shree,
>
> Thanks for your help, I was able to successfully train with the boxfiles.
> Is it possible to not provide any font data at all? Theoretically, if I was
> training for a document that did not have any font data available on the
> web, what would I do then?
> In tesstrain.sh, after I copy the box tiff pairs into /tmp like you said,
> does the script still generate box-tiff pairs using font data? It seems
> that the lines that say
>
> phase_I_generate_image 8
> phase_UP_generate_unicharset
>
> serve this function. Is the script still relying on training data
> generated by font data? Sorry, I'm not clear on the entire process that
> tesstrain.sh uses.
>
> Thanks once again,
> Dennis
>
> On Sunday, April 15, 2018 at 1:55:16 AM UTC-7, shree wrote:
>>
>> Hi Dennis,
>>
>> 1. Copy 4.0 format box/tiff pairs to langdata/$lang directory or any
>> other folder of your choice.
>>
>> 2. Modify tesstrain.sh to copy these files to your /tmp directory - see
>> following for where the lines need to be added
>>
>>
>> source "$(dirname $0)/tesstrain_utils.sh"
>>
>> ARGV=("$@")
>> parse_flags
>>
>> mkdir -p ${TRAINING_DIR}
>> tlog "\n=== Starting training for language '${LANG_CODE}'"
>>
>> # copy box tiff pairs from langdata/lang directory #shree
>> cp ./langdata/${LANG_CODE}/*.tif "${TRAINING_DIR}/"  #shree
>> cp ./langdata/${LANG_CODE}/*.box "${TRAINING_DIR}/"  #shree
>> ls -l "${TRAINING_DIR}/"#shree
>>
>> source "$(dirname $0)/language-specific.sh"
>> set_lang_specific_parameters ${LANG_CODE}
>>
>> 3. run tesstrain.sh with at least one font and sample training text to
>> use, in addition to the provided box/tiff pairs.
>>
>>
>>
>>
>>
>>
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Sun, Apr 15, 2018 at 12:36 PM,  wrote:
>>
>>> Hi shree,
>>>
>>> Thanks for your reply. Is there any option to use tesstrain.sh in
>>> tesseract 4.0 to generate the traineddata and lstm files using the image
>>> and boxfiles? Or do I still have to go through the process as listed in the
>>> Tesseract 3.0 instructions? In which case, I would be able to generate the
>>> traineddata file (and the unicharset file, I think), but not the lstm file.
>>> How can I generate this lstm file? Is there a tool I can use?
>>>
>>> Thanks again,
>>> Dennis
>>>
>>> On Friday, April 13, 2018 at 5:19:47 AM UTC-7, shree wrote:

 training Tesseract 4.0 from images is not officially .supported .   
 Different
 people have had success in doing LSTM training with box/tiff pairs. but it
 requires hacks/programming on their part to create 4.0.0 compatible box
 files.

 tesstrain.sh creates box/tiff files in the /tmp directory, these are
 used to create the lstmf files for LSTMtraining. tesstrain.sh can create a
 3.0x compatible traineddata or 4.0.0 compatible starter traineddata
 depending on options that are chosen. For 4.0.0 this starter traineddata
 alongwith the lstmf files is used for LSTM training.

 The format of traineddata files for 3.0x and 4.0.0 is different.

 For different components of a traineddata file, See


 https://github.com/tesseract-ocr/tesseract/blob/master/doc/combine_tessdata.1.asc

 For creating 4.0 compatible box files see


 https://github.com/tesseract-ocr/langdata/issues/83#issuecomment-375247341


 https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#training-tesseract-lstm-engine

 Please note that all these are unsupported options.


 ShreeDevi
 
 भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

 On Fri, Apr 13, 2018 at 12:09 PM,  wrote:

> Hi all,
>
> I read in a different post that training Tesseract 4.0 from images is
> not supported, is this true? I have been able to successfully train
> Tesseract 4.0 so far using font data. When using tesstrain.sh, the script
> creates a number of files, including an lstmf file alongside the usual
> trainedata file (and there are some others like unicharset). I was
> wondering if it is possible to use the traineddata generation from image
> and boxfile described in the Tesseract 3.0 training instructions to create
> these training files to train Tesseract 4.0. Tesseract 3.0 instructions
> already produce a traineddata file, how can I generate the lstmf file (and
> the others) if it is possible?
>
> Thank you,
> Dennis
>
> --
> You received this message because you are subscribed to the

Re: [tesseract-ocr] Training tessract 4.0 using images?

2018-04-15 Thread ShreeDevi Kumar

Hi Dennis,

1. Copy 4.0 format box/tiff pairs to langdata/$lang directory or any other
folder of your choice.

2. Modify tesstrain.sh to copy these files to your /tmp directory - see
following for where the lines need to be added


source "$(dirname $0)/tesstrain_utils.sh"

ARGV=("$@")
parse_flags

mkdir -p ${TRAINING_DIR}
tlog "\n=== Starting training for language '${LANG_CODE}'"

# copy box tiff pairs from langdata/lang directory #shree
cp ./langdata/${LANG_CODE}/*.tif "${TRAINING_DIR}/"  #shree
cp ./langdata/${LANG_CODE}/*.box "${TRAINING_DIR}/"  #shree
ls -l "${TRAINING_DIR}/"#shree

source "$(dirname $0)/language-specific.sh"
set_lang_specific_parameters ${LANG_CODE}

3. run tesstrain.sh with at least one font and sample training text to use,
in addition to the provided box/tiff pairs.








ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Apr 15, 2018 at 12:36 PM,  wrote:

> Hi shree,
>
> Thanks for your reply. Is there any option to use tesstrain.sh in
> tesseract 4.0 to generate the traineddata and lstm files using the image
> and boxfiles? Or do I still have to go through the process as listed in the
> Tesseract 3.0 instructions? In which case, I would be able to generate the
> traineddata file (and the unicharset file, I think), but not the lstm file.
> How can I generate this lstm file? Is there a tool I can use?
>
> Thanks again,
> Dennis
>
> On Friday, April 13, 2018 at 5:19:47 AM UTC-7, shree wrote:
>>
>> training Tesseract 4.0 from images is not officially .supported .   Different
>> people have had success in doing LSTM training with box/tiff pairs. but it
>> requires hacks/programming on their part to create 4.0.0 compatible box
>> files.
>>
>> tesstrain.sh creates box/tiff files in the /tmp directory, these are used
>> to create the lstmf files for LSTMtraining. tesstrain.sh can create a 3.0x
>> compatible traineddata or 4.0.0 compatible starter traineddata depending on
>> options that are chosen. For 4.0.0 this starter traineddata alongwith the
>> lstmf files is used for LSTM training.
>>
>> The format of traineddata files for 3.0x and 4.0.0 is different.
>>
>> For different components of a traineddata file, See
>>
>> https://github.com/tesseract-ocr/tesseract/blob/master/doc/c
>> ombine_tessdata.1.asc
>>
>> For creating 4.0 compatible box files see
>>
>> https://github.com/tesseract-ocr/langdata/issues/83#issuecom
>> ment-375247341
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LST
>> M#training-tesseract-lstm-engine
>>
>> Please note that all these are unsupported options.
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Fri, Apr 13, 2018 at 12:09 PM,  wrote:
>>
>>> Hi all,
>>>
>>> I read in a different post that training Tesseract 4.0 from images is
>>> not supported, is this true? I have been able to successfully train
>>> Tesseract 4.0 so far using font data. When using tesstrain.sh, the script
>>> creates a number of files, including an lstmf file alongside the usual
>>> trainedata file (and there are some others like unicharset). I was
>>> wondering if it is possible to use the traineddata generation from image
>>> and boxfile described in the Tesseract 3.0 training instructions to create
>>> these training files to train Tesseract 4.0. Tesseract 3.0 instructions
>>> already produce a traineddata file, how can I generate the lstmf file (and
>>> the others) if it is possible?
>>>
>>> Thank you,
>>> Dennis
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/bc664de6-5386-45b3-ae4d-70ac5338938c%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/385272ec-6801-4efd-957a-1bb5bc47175e%
> 40googlegroups.com
>

Re: [tesseract-ocr] Training tessract 4.0 using images?

2018-04-13 Thread ShreeDevi Kumar

 training Tesseract 4.0 from images is not officially .supported .   Different
people have had success in doing LSTM training with box/tiff pairs. but it
requires hacks/programming on their part to create 4.0.0 compatible box
files.

tesstrain.sh creates box/tiff files in the /tmp directory, these are used
to create the lstmf files for LSTMtraining. tesstrain.sh can create a 3.0x
compatible traineddata or 4.0.0 compatible starter traineddata depending on
options that are chosen. For 4.0.0 this starter traineddata alongwith the
lstmf files is used for LSTM training.

The format of traineddata files for 3.0x and 4.0.0 is different.

For different components of a traineddata file, See

https://github.com/tesseract-ocr/tesseract/blob/master/doc/combine_tessdata.1.asc

For creating 4.0 compatible box files see

https://github.com/tesseract-ocr/langdata/issues/83#issuecomment-375247341

https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#training-tesseract-lstm-engine

Please note that all these are unsupported options.


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Apr 13, 2018 at 12:09 PM,  wrote:

> Hi all,
>
> I read in a different post that training Tesseract 4.0 from images is not
> supported, is this true? I have been able to successfully train Tesseract
> 4.0 so far using font data. When using tesstrain.sh, the script creates a
> number of files, including an lstmf file alongside the usual trainedata
> file (and there are some others like unicharset). I was wondering if it is
> possible to use the traineddata generation from image and boxfile described
> in the Tesseract 3.0 training instructions to create these training files
> to train Tesseract 4.0. Tesseract 3.0 instructions already produce a
> traineddata file, how can I generate the lstmf file (and the others) if it
> is possible?
>
> Thank you,
> Dennis
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/bc664de6-5386-45b3-ae4d-70ac5338938c%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUTs%2BZCSOUa6mQ6W%3DqQ9q-r%2BeBPa%3D3qjAss6zowy44nZQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Change unicharset

2018-04-12 Thread ShreeDevi Kumar

1. concatenate the two training texts

cat ./langdata/kor/kor.training_text
./langdata/chi_tra/chi_tra.training_text >
./langdata/kor/kor-chi_tra.training_text


2. run tesstrain.sh with (update for your paths, run with just one font
which supports both languages as a test)

$tesstrain_dir/tesstrain.sh \
   --lang kor \
   --linedata_only\
   --noextract_font_properties \
   --exposures "0" \
   --fonts_dir /usr/share/fonts/ \
   --fontlist "Arial" \
   --langdata_dir ./langdata \
   --tessdata_dir  ./tessdata_best \
   --training_text  ./langdata/kor/kor-chi_tra.training_text \
   --output_dir $train_output_dir

3.  Check the unicharset in the generated starter traineddata

 $train_output_dir/kor/kor.unicharset

This should have unichars from both languages.

4.   cat ./langdata/kor/kor.wordlist ./langdata/chi_tra/chi_tra.wordlist >
./langdata/kor/kor-chi_tra.wordlist

5.  combine_tessdata -e  ./tessdata_best/kor.traineddata
 $train_output_dir/kor.lstm

etc



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Apr 13, 2018 at 10:35 AM, Fanatico  wrote:

> And if I look at the "kor.unicharset" created after executing
> "training/tesstrain.sh" it only contains the korean characters, even after
> I changing "kor.lstm-unicharset" from the "kor.traineddata"
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/5b7a5744-52fb-49fb-a0ec-555e0827d61c%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVcitJKxdTCZ9c%2BmCCuM4ua2rNwAVnAREoWwYkMx9MNFQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Change unicharset

2018-04-12 Thread ShreeDevi Kumar

You cannot just overwrite the lstm.unicharset in a tarineddata file, the
unicharset has to be in sync with the other files in it i.e. lstm, dawgs,
recoder etc.

>  I'm merging the ```kor.training_text``` with the
```chi_tra.training_text``` for tests

You need to go through the complete training process after this. Only then
both set of characters will reflected in it.

You can try add a layer training with tessdata_best/kor.traineddata to
continue from.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Apr 13, 2018 at 7:51 AM, Fanatico  wrote:

> I'm trying to add Chinese to my Korean charset, but I'm not able to do it.
>
> Obs.: Since Korean can use some Chinese characters (hanja) I'm merging the
> ```kor.training_text``` with the ```chi_tra.training_text``` for tests
>
> Reference:
> https://en.wikipedia.org/wiki/Hanja
> https://www.howtostudykorean.com/hanja-unit-1-lessons-1-20/hanja-lesson-1/
>
> I tried to use:
> combine_tessdata -u ~/projects/tessdata_best/kor.traineddata
> ~/projects/ocr/tmp/kor.
> combine_tessdata -o ~/projects/tesseract/tessdata/kor.traineddata
> ~/projects/ocr/tmp/kor.lstm-unicharset
>
> I tried to use this line on "training/tesstrain.sh":
> --wordlist ~/projects/ocr/training/kortrain/kor.wordlist \
>
> and I tried to use this line in the "kor.config" file
> tessedit_load_sublangs chi_tra
>
>
> But all these failed, if I run "training/tesstrain.sh" and go to the
> "kor/kor.unicharset" file, it only contains the Korean charset and I get
> errors like these:
> Other case Ｌ of ｌ is not in unicharset
> Mirror 〔 of 〕 is not in unicharset
> Mirror 】 of 【 is not in unicharset
> Mirror ［ of ］ is not in unicharset
> Mirror 「 of 」 is not in unicharset
> Setting script properties
> Warning: properties incomplete for index 71 = ｌ
> Warning: properties incomplete for index 153 = ，
> Warning: properties incomplete for index 182 = ？
> Warning: properties incomplete for index 313 = １
> Warning: properties incomplete for index 314 = ０
> Warning: properties incomplete for index 368 = ５
> Warning: properties incomplete for index 579 = ］
> Warning: properties incomplete for index 720 = －
> Warning: properties incomplete for index 918 = ２
> Warning: properties incomplete for index 941 = ￥
> Warning: properties incomplete for index 969 = ＆
> Config file is optional, continuing...
> Null char=2
>
> If I run an test in a "training/lstmeval" that have Chinese and korean
> characters:
> ~/projects/tesseract/training/lstmeval \
>   --model ~/projects/tesseract/tessdata/kor.traineddata \
>   --eval_listfile ~/projects/ocr/training/kortrain/eval/kor.training_
> files.txt
>
> I get a lot of these errors:
> Can't encode transcription: '文章輯旭攝影會員肥功能 桐獎功能 時可以麂榻榻米(瘋狂using 辛亥道具' in
> language ''
> Encoding of string failed! Failure bytes: ffe6 ffa0 ffb4
> ffe6 ffaa ff80 ffe6 ffbd ff98 ffe7 ff9f
> ffb3 ffe5 ffb1 ffb9 ffe5 ffaf ffba ffe5
> ffbb ff9f ffe5 ffb3 ffbb 20 ffe7 ffa7 ff92
> ffe4 ffb8 ff89 ffe8 ff89 ffb2 ffe8 ff8f
> ffab 20 ffe6 ff98 ff9f ffe6 ff9c ff9f ffe4
> ffba ff94 ffe5 ff98 ffa7 43 44 ffe4 ffbd
> ffbf ffe7 ff94 ffa8 ffe6 ffb4 ffaa ffe7
> ff91 ff9e ffe9 ff9c ff99 ffe6 ff85 ffb3
> ffe5 ff8d ff94 ffe8 ffad ffb0 20 ffe6 ff84
> ff9f ffe5 ff98 ff86 32 37 ffe6 ff92 ffb3 20
> ffe6 ffb1 ff95 ffe5 ffb0 ffbe
> Can't encode transcription: '栴檀潘石屹寺廟峻 秒三色菫 星期五嘧CD使用洪瑞霙慳協議 感嘆27撳 汕尾' in
> language ''
> Encoding of string failed! Failure bytes: ffe5 ffad ffa2
> ffe5 ffad ff90 4c 56 20 ffe6 ffb7 ffb1 ffe5
> ff9c ffb3 20 ffe5 ff92 ff96 ffe5 ff95 ffa1
> 20 ffe4 ffb8 ff8a ffe7 ffb7 ff9a 20 ffe6
> ffa6 ffab 20 ffe9 ff83 ffad ffe6 ffb3 ff93
> ffe5 ffbf ff97 ffe6 ff92 ffac 20 28 ffe6
> ffb0 ff91 ffe5 ff9c ff8b ffe6 ff9b ff86 20
> ffe6 ffb7 ffa4 ffe7 ffa9 ff8d 47 55 43 43 49 30 38
> ffe5 ff87 ffba ffe6 ff88 ff96 ffe8 ff80
> ff85 ffe6 ff94 ffbf 7c 68 61 73
> Can't encode transcription: '孢子LV 深圳 咖啡 上線 榫 郭泓志撬 (民國曆 淤積GUCCI08出或者政|has'
> in language ''
> Encoding of string failed! Failure bytes: ffe5 ff88 ff97
> ffe8 ffa1 ffa8 ffe7 ff9a ff84 ffe3 ff80
> ff8f ffe9 ff86 ff8d ffe9 ff86 ff90 20 2d
> ffe4 ffb8 ff80 ffe5 ff85 ffb6 ffe9 ffa4
> ff98 ffe6 ffb3 ff95 ffe5 ff8b ff99 37 36 38
> ffe4

Re: [tesseract-ocr] Column splitting failed around fuzzy line

2018-04-11 Thread ShreeDevi Kumar

Try to look at leptonica sample programs about column splitting to see if
you can preprocess the image better, before giving to tesseract


On Wed 11 Apr, 2018, 11:46 AM Ewan Mellor,  wrote:

> Hi,
>
>
> I am using Tesseract 4 (git 10f4998a) to process a file with two columns.
> A snippet of the image is shown below.  The problem is that there is a
> fuzzy line between the two columns, and the column detector has got
> confused.  I've ended up with one block covering the first column up to
> "The" on the second line, but then a block covering both columns with the
> "patient has ..." all the way across to "history of low".
>
>
> I've looked in the debug views, and it looks to me like the line removal
> hasn't managed to remove that fuzzy line down the middle.  The "good" is
> then close enough that the column finder is deciding to merge the two
> blocks on that line.
>
>
> Looking at the code in linefind.cpp and colfind.cpp, I see lots of
> constants for various thresholds, but I don't see any configurable ones,
> and I'm not sure which way to go now.  Would it be better to work on the
> line detector in linefind.cpp and try and get rid of that vertical line?
> Or would I be better to run a columnar histogram and try and do column
> splitting myself?  Or should I ignore the fact that the line hasn't been
> removed, and concentrate on tightening up the column finder so that it's
> able to separate these two columns correctly?  It seems to me that there's
> enough of a gap there that it ought to be able to separate the columns (it
> does a pretty good job on the rest of the document, so it can't be far off).
>
>
> Any recommendations would be appreciated.
>
>
> Thanks,
>
>
> Ewan.
>
>
>
>
>
> 
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/bdee5651-c305-4bbb-a14c-ccd5ba5cd7e2%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWVwFi%3D-hNX_scaod%2Ba7Pp0esJmCz3MtLSAkM7PAVq%3Ddw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Error opening traineddata files on Mac High Sierra

2018-04-11 Thread ShreeDevi Kumar

https://github.com/tesseract-ocr/tesseract/issues/660

Regarding pdf

On Wed 11 Apr, 2018, 1:28 PM ShreeDevi Kumar, <shreesh...@gmail.com> wrote:

> 1. Check the output tif and adjust convert command if needed
>
> 2. Depending on your tesseract version you could try -l frk also.
>
> 3. Yes, you can get a pdf as output.
>
> Search Github issues, there is a long discussion thread regarding best
> ways to create a pdf output.
>
> Look for pdf and invisible pdf.
>
> On Wed 11 Apr, 2018, 1:03 PM Firlefanz, <firlefanze...@gmail.com> wrote:
>
>>
>> It works! I am so relieved. Thank you all for the help.
>>
>> Still I have a couple of questions since I've read a couple of tutorials,
>> each using other commands:
>>
>> 1. Converting my Fraktur pdf files in tiff I use imagemagick. Is this the
>> right command? convert -density 300 test.pdf -depth 8 -strip -background
>> white -alpha off test.tiff
>>
>> 2. For tesseract then the command: tesseract test.tiff outtest -l deu_frak
>> With this I get a txt version of the tiff.
>>
>> 3. Not that it matters too much (I'm over the moon that it works like
>> this), can I get as an output instead of a txt the original pdf just with a
>> search-and-copy-function?
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/433f3fad-e316-49aa-9a93-367ee596a7e6%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/433f3fad-e316-49aa-9a93-367ee596a7e6%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW4ehEG7gyPS79fkLXbrvZkm5H7E7oWXFfukyUcYM__HQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Error opening traineddata files on Mac High Sierra

2018-04-11 Thread ShreeDevi Kumar

1. Check the output tif and adjust convert command if needed

2. Depending on your tesseract version you could try -l frk also.

3. Yes, you can get a pdf as output.

Search Github issues, there is a long discussion thread regarding best ways
to create a pdf output.

Look for pdf and invisible pdf.

On Wed 11 Apr, 2018, 1:03 PM Firlefanz,  wrote:

>
> It works! I am so relieved. Thank you all for the help.
>
> Still I have a couple of questions since I've read a couple of tutorials,
> each using other commands:
>
> 1. Converting my Fraktur pdf files in tiff I use imagemagick. Is this the
> right command? convert -density 300 test.pdf -depth 8 -strip -background
> white -alpha off test.tiff
>
> 2. For tesseract then the command: tesseract test.tiff outtest -l deu_frak
> With this I get a txt version of the tiff.
>
> 3. Not that it matters too much (I'm over the moon that it works like
> this), can I get as an output instead of a txt the original pdf just with a
> search-and-copy-function?
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/433f3fad-e316-49aa-9a93-367ee596a7e6%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUtJfypH_zDxKTkjowxCF0mYc5V8t5_abs2E2z4y_q4Xw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Doubt on "--eval_listfile"

2018-04-10 Thread ShreeDevi Kumar

Yes, and you can use different text files for training and eval.



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Apr 10, 2018 at 10:01 PM, Fanatico  wrote:

> wen I asked about passing the ".training_text" as a param, I meant in the
> creation of the training data "training/tesstrain.sh"
>
> On Tuesday, 10 April 2018 13:30:05 UTC-3, Fanatico wrote:
>>
>> I just thought, but can I pass only the ".training_text" file as a param ?
>> like --training_text
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/3b918a9a-0d49-4b28-b624-0e2e9df03f1a%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVv8%3DVBvSJX7KXCJEazZjT%2Bfisj2efxB1mq2ApNGygz3g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Doubt on "--eval_listfile"

2018-04-10 Thread ShreeDevi Kumar

To make sure that the model is not overfitted to training data, your eval
set should be different.

You can use a different text file, different fonts from the training set to
check that the model performs well on text and fonts it has not seen
earlier.

On Tue 10 Apr, 2018, 8:16 PM Fanatico,  wrote:

> Platform: MAC OS X
> Tesseract: 4.0.0-beta.1-69-g10f4
>
> Wen I execute a command like:
>
> SCROLLVIEW_PATH=~/projects/tesseract/java \
>   ~/projects/tesseract/training/lstmtraining \
> --debug_interval 100 \
> --continue_from
> ~/projects/ocr/training/kortrain/kor_from_full/kor.lstm \
> --traineddata
> ~/projects/ocr/training/kortrain/new_train/kor/kor.traineddata \
> --append_index 5 \
> --net_spec '[Lfx256 O1c111]' \
> --model_output ~/projects/ocr/training/kortrain/kor_from_full/base \
> --train_listfile
> ~/projects/ocr/training/kortrain/new_train/kor.training_files.txt \
> --eval_listfile
> ~/projects/ocr/training/kortrain/eval/kor.training_files.txt \
> --target_error_rate 1
> &>~/projects/ocr/training/kortrain/kor_from_full/basetrain.log
>
> I have "--train_listfile" that tells the location of my training files
> for each font and I have "--eval_listfile" that I suppose is the location
> for the training files used to test the result of the training.
>
> So my doubt is:
> 1 - Why I'm training with the fonts "A", "B" and "C" but testing with the
> fonts "D", "E" and "F"?
> 2 - And if I need to test using the same fonts, then why do I need to pass
> 2 times the same file?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/532b2514-ff7d-4c2c-998a-d61a2aee653a%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW9syRYqEWAMUSqaE%3DWY2TnRCp3BXPrnQ0pdTaAduxdNQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] How to created training text as provided in langdata for any new language if i have just just have a wordlist.

2018-04-09 Thread ShreeDevi Kumar

For tesseract 3.05

random text will work, it is suggested to use combos similar to English
training text.

It is unlikely you will get answers to your questions from the developers.
You can search past issues/questions in forum and github.

3.05 training does not take long, run a few experiments for your 'language'
and test.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Apr 9, 2018 at 2:15 PM, Romil Mehla <meh...@gmail.com> wrote:

> Hi Shree Thanks for replying
>
> For tesseract *3.05.00*
>
> I had already checked that link there they mentioned
> *"Make sure there are a minimum number of samples of each character. 10 is
> good, but 5 is OK for rare characters.*
> *There should be more samples of the more frequent characters - at least
> 20.*
> *Don't make the mistake of grouping all the non-letters together. Make the
> text more realistic"*
>
> Does it holds for langdatat eng.training_text if yes  Then that means they
> are generating it randomly . How randomly generated training text can
> assure accuracy.
> Also they have mentioned each character should have minimum sample of 10 ,
> why so , where in code this criteria is used . I have checked code but
> could not find this criteria anywhere . Is it related to algorithm ? then
> which one adaptive of shape classifier or related to bounding box
> coordinates .
>
> Please clear my doubts and if required please pull Ray or someone from dev
> team as well as i have doubts regarding tesseract code as well.
> I could not post in tesseract-dev forum because doubts should be asked in
> tesseract =user list only
>
> Then how can i have tesseract developer answer my question. Please tell me
> the way
>
> Thanks again for your timely reply and help .
>
>
>
>
> On Sat, Apr 7, 2018 at 6:21 PM, ShreeDevi Kumar <shreesh...@gmail.com>
> wrote:
>
>> see  https://github.com/tesseract-ocr/tesseract/wiki/Trainin
>> g-Tesseract-3.03%E2%80%933.05
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Sat, Apr 7, 2018 at 4:02 PM, Romil Mehla <meh...@gmail.com> wrote:
>>
>>> Thanks for your reply , i have read about tesseract 4.0 and Ray
>>> mentioned how he used so many files to train tesseract 4.0 but i dont want
>>> to use tesseract 4.0 , i wanted to know about tesseract 3.05.00 , from my
>>> understanding suppose for eng languaur . eng.training_text file is build
>>> from eng.wordlist  file mentioned in langdata. For a new language how can i
>>> build training text from my new languaue wordlist ,any idea on who has
>>> created the eng.training_text  file ? is there any rule or algorithm to do
>>> so , or it is randomly generated from eng.wordlist by maintaining minimum
>>> 10 times occurrence of a character in training text.
>>>
>>>
>>>
>>> Please clarify on this , please let me know how to generate
>>> traning_text??
>>>
>>> On Saturday, April 7, 2018 at 3:46:10 PM UTC+5:30, shree wrote:
>>>>
>>>> Just a word list is not enough for training text.
>>>>
>>>> For tesseract 4.0.0 it needs to be representative of the text to be
>>>> recognized.
>>>>
>>>> On Sat 7 Apr, 2018, 2:50 PM Romil Mehla, <meh...@gmail.com> wrote:
>>>>
>>>>> Is there any program to generate it ?  i see ambiguous_words.cpp
>>>>> generating dictionary words and ambiguous words where is it used ? or it
>>>>> can be used to build unicharambigs file to generate rules ?
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/2ce880b4-b75
>>>>> 0-4be9-a1a0-01f832f679df%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/2ce880b4-b750-4be9-a1a0-01f832f679df%40googlegroups.com?utm_medium=email_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>&g

Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-09 Thread ShreeDevi Kumar

For Korean, please check whether adding the following lines to config,
improves your results further.

#Fixes https://github.com/tesseract-ocr/tesseract/issues/1009
preserve_interword_spaces 1


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Apr 9, 2018 at 1:45 PM, ShreeDevi Kumar <shreesh...@gmail.com>
wrote:

> Leftover from 3.04, my guess.
>
> On Mon 9 Apr, 2018, 12:52 PM Fanatico, <fanatico.s...@gmail.com> wrote:
>
>> It worked, thanks.
>>
>> Any reason for this chi_tra there?
>>
>>
>> On Monday, 9 April 2018 03:24:44 UTC-3, shree wrote:
>>>
>>> Please remove the sub language line from config file, and use combine
>>> tessdata to overwrite it.
>>>
>>> Right now it seems to be using chi_tra also.
>>>
>>> On Mon 9 Apr, 2018, 11:48 AM Fanatico, <fanati...@gmail.com> wrote:
>>>
>>>> I used one traineddata that I created on removing the top layer from
>>>> the kor.traineddata from "tessdata_best", after this I replaced this
>>>> traineddata with the one from "tessdata_best" and got the same problem.
>>>>
>>>> Yes, it include chi_tra as sublanguage
>>>> tessedit_load_sublangs chi_tra
>>>>
>>>> lstm-unicharset only has corean characters
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/tesseract-ocr/0d50ee2b-b5d4-4c73-a45b-d5245403ad04%
>>>> 40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/0d50ee2b-b5d4-4c73-a45b-d5245403ad04%40googlegroups.com?utm_medium=email_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/
>> msgid/tesseract-ocr/8496ad57-f7eb-426c-a4ae-5d365c56bc96%
>> 40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/8496ad57-f7eb-426c-a4ae-5d365c56bc96%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU4j1QD_zrAPGws_5ztQh1De6%3DGtHKnzNTHW%3DkeNX2qgg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-09 Thread ShreeDevi Kumar

Leftover from 3.04, my guess.

On Mon 9 Apr, 2018, 12:52 PM Fanatico,  wrote:

> It worked, thanks.
>
> Any reason for this chi_tra there?
>
>
> On Monday, 9 April 2018 03:24:44 UTC-3, shree wrote:
>>
>> Please remove the sub language line from config file, and use combine
>> tessdata to overwrite it.
>>
>> Right now it seems to be using chi_tra also.
>>
>> On Mon 9 Apr, 2018, 11:48 AM Fanatico,  wrote:
>>
>>> I used one traineddata that I created on removing the top layer from the
>>> kor.traineddata from "tessdata_best", after this I replaced this
>>> traineddata with the one from "tessdata_best" and got the same problem.
>>>
>>> Yes, it include chi_tra as sublanguage
>>> tessedit_load_sublangs chi_tra
>>>
>>> lstm-unicharset only has corean characters
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/0d50ee2b-b5d4-4c73-a45b-d5245403ad04%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/8496ad57-f7eb-426c-a4ae-5d365c56bc96%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUUsnmjCxN9btT0sVbSVmCZy%2Bxv6QXOe75vdZDAHuG1Fg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-09 Thread ShreeDevi Kumar

Please remove the sub language line from config file, and use combine
tessdata to overwrite it.

Right now it seems to be using chi_tra also.

On Mon 9 Apr, 2018, 11:48 AM Fanatico,  wrote:

> I used one traineddata that I created on removing the top layer from the
> kor.traineddata from "tessdata_best", after this I replaced this
> traineddata with the one from "tessdata_best" and got the same problem.
>
> Yes, it include chi_tra as sublanguage
> tessedit_load_sublangs chi_tra
>
> lstm-unicharset only has corean characters
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/0d50ee2b-b5d4-4c73-a45b-d5245403ad04%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV3O9Bh%3DfwjzL5aMmZmChkPfpMW3%2BOw5TVUHRRRL7pD4Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-08 Thread ShreeDevi Kumar

Which traineddata are you using?

Use combine_tessdata and extract the config file to see if chinese is
included as sub language.

Also look at the lstm-unicharset to see if the Chinese characters are
included in it.

On Mon 9 Apr, 2018, 11:09 AM Fanatico,  wrote:

> I'm running tesseract with the "-l kor" param but it is detecting some
> chinese characters, the image really have 3 chinese characters but none of
> them is returning correctly (and I'm not expecting them to return
> correctly) but the others korean characters are being recognized as chinese
> characters
>
> tesseract teste_kor.tif teste_kor -l kor --oem 3 --psm 6
>
> Any idea of how to fix it?
>
>
>
> 
>
>
> Result:
>
>
> 1 화
>
>
> 서 05)
>
>
> 수 마 0 뜨 \) 에 사 로 잡혀 눈 을 도 저
>
> 히 뜰 수가 없다.
>
>
> 힘 을 내 도 겨우 반 개 하는 것이 고
>
> 작 . 그 이상 움직일 수가 없었다.
>
> " 아 ‥…. 7
>
>
> 苗 朮 習 趾 葉 刁 估 舍 點 選 們 同 對 刀
>
> 려 소 리 를 낸다. 하지만 신 음 에 가
>
> 까운 목 소 리 만 홀 러 나 올 뿐이었다.
>
> “장로 Q 全 程 ::: 가 시 면 ‥.”
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/1e5142e1-d198-46d3-95ee-1a3206d1a2c4%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUBio3cRuAC39kUnoRB3%2B1WbmaSDGhqvWp%2BW_VV_QK9ig%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Install and run tesseract 4.0 on MAC OSX step by step

2018-04-08 Thread ShreeDevi Kumar

Thank you.

On Sun 8 Apr, 2018, 3:20 PM Fanatico,  wrote:

> I just posted at the repo issues a step to step that I needed to do so I
> could use tessercat 4.0 from my MAC, so I'm just sharing the link in case
> someone has the same problems I got.
> Obs.: It can save a few days of your life
>
> https://github.com/tesseract-ocr/tesseract/issues/1453
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/e4a467f9-ca49-46fb-a526-a3b66fe54519%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXk98ijnwZ8rozP-xPZJDBvSt9koiAhpUraGCjYqZm_DA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Failed to build ScrollView.jar on MAC OSX

2018-04-07 Thread ShreeDevi Kumar

Please try from the main tesseract folder.



On Sat 7 Apr, 2018, 11:50 PM Fanatico,  wrote:

> from the java folder "cd ~/projects/tesseract/java" in my case
>
> On Saturday, 7 April 2018 12:40:29 UTC-3, shree wrote:
>>
>> Please see
>> https://github.com/tesseract-ocr/tesseract/blob/master/Makefile.am
>>
>> From which dir did you try
>>
>> make ScrollView.jar
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c37ccc0e-93c1-480c-ad21-19a3611a45e6%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWO08uYgmqwdi5vNhuZ1Q%2BPxvWK31ySaYHRuHizpLbYFQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Failed to build ScrollView.jar on MAC OSX

2018-04-07 Thread ShreeDevi Kumar

Please see
https://github.com/tesseract-ocr/tesseract/blob/master/Makefile.am

>From which dir did you try

make ScrollView.jar

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Apr 7, 2018 at 7:42 PM, Fanatico  wrote:

> Hi. I finally got the training from 4.o to work, but I was unable to build
> the ScrollView.jar so Im currently running the test with "--debug_interval
> -1". Can someone help Me?
>
> Sistem
>
> Platform: MAC OS X  10.13.3 (installed with brew)
> Tesseract: 4.0.0-beta.1
> leptonica: 1.75.3
>   libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
>
> Steps
>
> *First Try*
>
> 1 - I cloned a new repository from git
> 2 - Added the "piccolo2d-core-3.0.jar" and "piccolo2d-extras-3.0.jar"
> files to the java folder
> 3 - Executed the command
> make ScrollView.jar
>
> 4 - Got this message from console:
> make: *** No rule to make target `ScrollView.jar'.  Stop.
>
> *Second Try*
>
> 1 - Downloaded the files already builded from: https://www.4shared.com/
> zip/FnP8RSu0/tess_debug_3_02.html
> 2 - Copied the "piccolo2d-core-3.0.jar", "piccolo2d-extras-3.0.jar" and "
> ScrollView.jar" to the java folder
> 3 - Executed the code:
> SCROLLVIEW_PATH=~/projects/tesseract/java \
> /usr/local/Cellar/tesseract/HEAD-f8e26ee/bin/lstmtraining \
> --debug_interval 100 \
> --traineddata ~/tesstutorial/eng/eng.traineddata \
> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
> --model_output ~/tesstutorial/engoutput/base \
> --learning_rate 20e-4 \
> --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
> --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
> --max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log
>
> 4 - Got this error:
> Loaded file /Users/fernandogot/tesstutorial/engoutput/base_checkpoint,
> unpacking...
> Successfully restored trainer from /Users/fernandogot/
> tesstutorial/engoutput/base_checkpoint
> Loaded 72/72 pages (1-72) of document /Users/fernandogot/
> tesstutorial/eng/eng.Verdana.exp0.lstmf
> Loaded 72/72 pages (1-72) of document /Users/fernandogot/
> tesstutorial/eng/eng.Verdana.exp0.lstmf
> Starting sh -c "trap 'kill %1' 0 1 2 ; java -Xms1024m -Xmx2048m -jar
> /Users/fernandogot/projects/tesseract/java/ScrollView.jar & wait"
> Error: Unable to access jarfile /Users/fernandogot/projects/
> tesseract/java/ScrollView.jar
> sh: line 0: kill: %1: no such job
>
> Any idea on how to fix it?
>
> Thanks for reading and for your time!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/4bdadb4e-549c-4361-8a83-53199633489a%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUe8%3DWrNH-a569gYFo6thsrGdx7L6%3DXVyRcJWZQ6W%2BEHg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] How to created training text as provided in langdata for any new language if i have just just have a wordlist.

2018-04-07 Thread ShreeDevi Kumar

see
https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.03%E2%80%933.05

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Apr 7, 2018 at 4:02 PM, Romil Mehla  wrote:

> Thanks for your reply , i have read about tesseract 4.0 and Ray mentioned
> how he used so many files to train tesseract 4.0 but i dont want to use
> tesseract 4.0 , i wanted to know about tesseract 3.05.00 , from my
> understanding suppose for eng languaur . eng.training_text file is build
> from eng.wordlist  file mentioned in langdata. For a new language how can i
> build training text from my new languaue wordlist ,any idea on who has
> created the eng.training_text  file ? is there any rule or algorithm to do
> so , or it is randomly generated from eng.wordlist by maintaining minimum
> 10 times occurrence of a character in training text.
>
>
>
> Please clarify on this , please let me know how to generate traning_text??
>
> On Saturday, April 7, 2018 at 3:46:10 PM UTC+5:30, shree wrote:
>>
>> Just a word list is not enough for training text.
>>
>> For tesseract 4.0.0 it needs to be representative of the text to be
>> recognized.
>>
>> On Sat 7 Apr, 2018, 2:50 PM Romil Mehla,  wrote:
>>
>>> Is there any program to generate it ?  i see ambiguous_words.cpp
>>> generating dictionary words and ambiguous words where is it used ? or it
>>> can be used to build unicharambigs file to generate rules ?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/2ce880b4-b750-4be9-a1a0-01f832f679df%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/fcfdc967-121e-480a-a0fe-e57f341115c7%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWcHvQfqitW37fh-tVk9GsfZq9Byc%3Dmv_cGM2Uipwp%2B5w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] How to created training text as provided in langdata for any new language if i have just just have a wordlist.

2018-04-07 Thread ShreeDevi Kumar

Just a word list is not enough for training text.

For tesseract 4.0.0 it needs to be representative of the text to be
recognized.

On Sat 7 Apr, 2018, 2:50 PM Romil Mehla,  wrote:

> Is there any program to generate it ?  i see ambiguous_words.cpp
> generating dictionary words and ambiguous words where is it used ? or it
> can be used to build unicharambigs file to generate rules ?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/2ce880b4-b750-4be9-a1a0-01f832f679df%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUR00Qt_JU%3DObasJXt-hezwQrZG9ybeXuY6yCNdNnUo0g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] ERROR: exp0.box does not exist or is not readable

2018-04-07 Thread ShreeDevi Kumar

Look in your tmp directory in the sub folders referred in the console output

Check the log file and other files there

On Sat 7 Apr, 2018, 11:00 AM Fanatico,  wrote:

> Yes the location is correct,  I tried to put the full path to the folder
> and go the same error.
>
> Im just cloned the https://github.com/tesseract-ocr/langdata repo
>
> On Friday, 6 April 2018 23:28:06 UTC-3, shree wrote:
>>
>> Is your langdata in   --langdata_dir ../../langdata
>>
>>>
>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/0c2a56a5-181b-4a8c-b29f-8869dfe8b22d%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWUYDSbTEg058UrxjCAazjcNeAxNy%2B5z-br_jhO6fiScA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] ERROR: exp0.box does not exist or is not readable

2018-04-06 Thread ShreeDevi Kumar

Is your langdata in   --langdata_dir ../../langdata

On Sat 7 Apr, 2018, 4:51 AM Fanatico,  wrote:

> I'm trying to execute the training from the 4.o tutorial, but I'm getting
> an error, can someone help with this?
>
> Platform: MAC OS X 10.13.3
> Tesseract: 4.0.0-beta.1
> leptonica: 1.75.3
> libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
>
>
> Code used
>
> ../../tesseract/training/tesstrain.sh \
>   --fonts_dir /Library/Fonts \
>   --lang eng --linedata_only \
>   --noextract_font_properties \
>   --exposures "0"\
>   --langdata_dir ../../langdata \
>   --tessdata_dir /usr/local/Cellar/tesseract/HEAD-f8e26ee/share/tessdata \
>   --fontlist "Verdana" \
>   --output_dir .~/tesstutorial/ara
>
> Result
>
> === Starting training for language 'eng'
> [Fri Apr 6 20:19:15 -03 2018] /usr/local/bin/text2image
> --fonts_dir=/Library/Fonts --font=Verdana
> --outputbase=/var/folders/xl/gqcd7ljn0k7d3r_3j9dy7x34gn/T/font_tmp.XX.aU9oTb7N/sample_text.txt
> --text=/var/folders/xl/gqcd7ljn0k7d3r_3j9dy7x34gn/T/font_tmp.XX.aU9oTb7N/sample_text.txt
> --fontconfig_tmpdir=/var/folders/xl/gqcd7ljn0k7d3r_3j9dy7x34gn/T/font_tmp.XX.aU9oTb7N
>
> === Phase I: Generating training images ===
> Rendering using Verdana
> [Fri Apr 6 20:19:17 -03 2018] /usr/local/bin/text2image
> --fontconfig_tmpdir=/var/folders/xl/gqcd7ljn0k7d3r_3j9dy7x34gn/T/font_tmp.XX.aU9oTb7N
> --fonts_dir=/Library/Fonts --strip_unrenderable_words --leading=32
> --char_spacing=0.0 --exposure=0
> --outputbase=/var/folders/xl/gqcd7ljn0k7d3r_3j9dy7x34gn/T/tmp.OaBuo1g2/eng/eng.Verdana.exp0
> --max_pages=3 --font=Verdana --text=../../langdata/eng/eng.training_text
> ERROR:
> /var/folders/xl/gqcd7ljn0k7d3r_3j9dy7x34gn/T/tmp.OaBuo1g2/eng/eng.Verdana.exp0.box
> does not exist or is not readable
> ERROR:
> /var/folders/xl/gqcd7ljn0k7d3r_3j9dy7x34gn/T/tmp.OaBuo1g2/eng/eng.Verdana.exp0.box
> does not exist or is not readable
>
> Observations
>
> I can find the fond if I use:
>
> text2image --list_available_fonts --fonts_dir=/Library/Fonts
>
> I tested some other fonts.
>
> Thanks for the time and reply!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/cbe9828e-690f-4bc4-8592-d195370d4857%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWz0tPqugJmi8mMOVKidXFiFjPsQ%2BOUzNzgp-y%3Dkw64WA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Traineed non unicode font with tesseract

2018-04-06 Thread ShreeDevi Kumar

Please see
https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage

For Indian languages, use tesseract-4.0.0beta.1
with the traineddata files from
https://github.com/tesseract-ocr/tessdata_fast

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Apr 6, 2018 at 12:04 PM, gopal bhalala <gopalbhal...@gmail.com>
wrote:

> Yes Shree. I am trying to recognized text from a PDF or image with non
> unicode font. I tried with make box and to do that but did not get sucess,
> Can you please give me any guidence on that how to do that?
>
> Best Regards & Thanking you,
> Gopal Dhanjibhai Bhalala
>
> On Fri, Apr 6, 2018 at 1:20 AM, ShreeDevi Kumar <shreesh...@gmail.com>
> wrote:
>
>> Are you trying to recognize the text from a pdf or image with non unicode
>> font?
>>
>> That is possible to do.
>>
>> If you want to train using non-unicode font, that is not possible.
>>
>> On Fri 6 Apr, 2018, 12:03 AM gopal bhalala, <gopalbhal...@gmail.com>
>> wrote:
>>
>>> Hi Shree,
>>>
>>> Thanks for the quick response, is there any way to train non unicode
>>> font PDF AND IMAGE?
>>> i have non unicode pdf file and image for ocr shall i box it and assing
>>> the uniode font charcter is it right way to do non unicode pdf or image to
>>> OCR.
>>>
>>> On 05-Apr-2018 7:25 AM, "ShreeDevi Kumar" <shreesh...@gmail.com> wrote:
>>>
>>>> Training tesseract is only supported using unicode fonts.
>>>>
>>>> On Thu 5 Apr, 2018, 12:25 AM gopal bhalala, <gopalbhal...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi I am new in tesseract-ocr. I want trainned non unicode font using
>>>>> tesseract, I tried with to trained it with jTextboxeditor to trained that
>>>>> data but did not get any sucess.
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/dc1825db-ef9
>>>>> 4-4bfd-bb3e-9e98d11faf07%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/dc1825db-ef94-4bfd-bb3e-9e98d11faf07%40googlegroups.com?utm_medium=email_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>>> gid/tesseract-ocr/CAG2NduWdm5%3DG9MoDskCLHfE1-bdy7pXZteR6HrN
>>>> p9EDjmzRy4w%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWdm5%3DG9MoDskCLHfE1-bdy7pXZteR6HrNp9EDjmzRy4w%40mail.gmail.com?utm_medium=email_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/CA%2BnTJPCbssxySUh7fNCD_fbHnOLg29v%
>>> 2BQXemYit4CaBAq%3DP3Jw%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CA%2BnTJPCbssxySUh7fNCD_fbHnOLg29v%2BQXemYit4CaBAq%3DP3Jw%40mail.gmail.com?utm_medium=email_source=footer>
>>> .
>>>

Re: [tesseract-ocr] Traineed non unicode font with tesseract

2018-04-05 Thread ShreeDevi Kumar

Are you trying to recognize the text from a pdf or image with non unicode
font?

That is possible to do.

If you want to train using non-unicode font, that is not possible.

On Fri 6 Apr, 2018, 12:03 AM gopal bhalala, <gopalbhal...@gmail.com> wrote:

> Hi Shree,
>
> Thanks for the quick response, is there any way to train non unicode font
> PDF AND IMAGE?
> i have non unicode pdf file and image for ocr shall i box it and assing
> the uniode font charcter is it right way to do non unicode pdf or image to
> OCR.
>
> On 05-Apr-2018 7:25 AM, "ShreeDevi Kumar" <shreesh...@gmail.com> wrote:
>
>> Training tesseract is only supported using unicode fonts.
>>
>> On Thu 5 Apr, 2018, 12:25 AM gopal bhalala, <gopalbhal...@gmail.com>
>> wrote:
>>
>>> Hi I am new in tesseract-ocr. I want trainned non unicode font using
>>> tesseract, I tried with to trained it with jTextboxeditor to trained that
>>> data but did not get any sucess.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/dc1825db-ef94-4bfd-bb3e-9e98d11faf07%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/dc1825db-ef94-4bfd-bb3e-9e98d11faf07%40googlegroups.com?utm_medium=email_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWdm5%3DG9MoDskCLHfE1-bdy7pXZteR6HrNp9EDjmzRy4w%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWdm5%3DG9MoDskCLHfE1-bdy7pXZteR6HrNp9EDjmzRy4w%40mail.gmail.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CA%2BnTJPCbssxySUh7fNCD_fbHnOLg29v%2BQXemYit4CaBAq%3DP3Jw%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CA%2BnTJPCbssxySUh7fNCD_fbHnOLg29v%2BQXemYit4CaBAq%3DP3Jw%40mail.gmail.com?utm_medium=email_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWus1TeUFGFfjmJT57vCWE7h_D%3DEQ2%3DtDoDmmscWajM8g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Traineed non unicode font with tesseract

2018-04-04 Thread ShreeDevi Kumar

Training tesseract is only supported using unicode fonts.

On Thu 5 Apr, 2018, 12:25 AM gopal bhalala,  wrote:

> Hi I am new in tesseract-ocr. I want trainned non unicode font using
> tesseract, I tried with to trained it with jTextboxeditor to trained that
> data but did not get any sucess.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/dc1825db-ef94-4bfd-bb3e-9e98d11faf07%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWdm5%3DG9MoDskCLHfE1-bdy7pXZteR6HrNp9EDjmzRy4w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Error at training 4.0

2018-04-04 Thread ShreeDevi Kumar

Training tesseract 4.0.0 is different from process for 3.0x.

Training  using images is not supported for tesseract 4.0.0.

See https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

On Thu 5 Apr, 2018, 1:36 AM Fanatico,  wrote:

> Hi, I'm new to tesseract and ocr in general, and need some help to train
> my tesseract.
>
> Config
> Platform: Mac OS X 10.13.3
> Tesseract Version: 4.0.0-beta.1
> leptonica: 1.75.3
>   libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
>
> images used
>
> kor.AppleMyungjo.exp1.tif
>
>
> 
>
>
> kor.AppleMyungjo.exp0.tif
>
>
> 
>
>
> Step by step
> I'm trying to train (fine tuning) my tesseract to better detect commas (")
> and dot (.) in korean, but I'm getting some errors. Here what I did until
> now:
>
> 1 - Got the Images, I'm using 2 images .tif (both images has only 1 line
> and few characters)
> 2 - Renamed the images to kor.AppleMyungjo.exp0.tif and
> kor.AppleMyungjo.exp1.tif
> 3 - Created the .box file for each image ```tesseract
> [language].[fontname].exp[samplenumber].tif
> [language].[fontname].exp[samplenumber] -l [language] batch.nochop
> makebox``` (one of them come empty)
> 4 - Corrected the .box files using the site
> https://pp19dd.com/tesseract-ocr-chopper/ (I just pasted the positioning
> in the file)
> 5 - Created the .tr files for each image ```tesseract
> kor.AppleMyungjo.exp0.tif kor.AppleMyungjo.exp0 -l kor box.train ``` (both
> image got an empty .tr file)
> 6 - Created the unicharset file ```unicharset_extractor [box file 0] [box
> file 1]...```
> 7 - Created the font_properties, only has the ```AppleMyungjo 0 0 1 0 0```
> 8 - Cloned the tesseract repo to my mac, path ```~/projects/tesseract```
> 9 - cloned the langdata repo to my mac, path ```~/projects/langdata```
> 10 - Found the folder where the brew installed my tesseract, path
> ```/usr/local/Cellar/tesseract/HEAD-f8e26ee/share/tessdata```
> 11 - Executed the ```~/projects/tesseract/training/tesstrain.sh``` file
>
>
> ```
> sudo ~/projects/tesseract/training/tesstrain.sh \
>   --fonts_dir /Library/Fonts  \
>   --lang kor \
>   --linedata_only  \
>   --noextract_font_properties  \
>   --exposures "0"\
>   --langdata_dir ~/projects/langdata \
>   --tessdata_dir /usr/local/Cellar/tesseract/HEAD-f8e26ee/share/tessdata \
>   --output_dir ~/tesstutorial/kor \
>   --fontlist "AppleMyungjo"
> ```
> and got the error:
> ```
> === Starting training for language 'kor'
> mktemp: illegal option -- -
> usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
>mktemp [-d] [-q] [-u] -t prefix
> [Wed Apr 4 13:26:24 -03 2018] /usr/local/bin/text2image
> --fonts_dir=/Library/Fonts --font=AppleMyungjo
> --outputbase=/sample_text.txt --text=/sample_text.txt --fontconfig_tmpdir=
> Fontconfig error: Cannot load default config file
>
> === Phase I: Generating training images ===
> Rendering using AppleMyungjo
> [Wed Apr 4 13:26:25 -03 2018] /usr/local/bin/text2image
> --fontconfig_tmpdir= --fonts_dir=/Library/Fonts --strip_unrenderable_words
> --leading=32 --char_spacing=0.0 --exposure=0
> --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0
> --max_pages=3 --font=AppleMyungjo
> --text=/Users/fernandogot/projects/langdata/kor/kor.training_text
> Fontconfig error: Cannot load default config file
> ERROR:
> /var/folders/zz/zyxvpxvq6csfxvn_n0/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0.box
> does not exist or is not readable
> ERROR:
> /var/folders/zz/zyxvpxvq6csfxvn_n0/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0.box
> does not exist or is not readable
> ```
>
> I found that the ```Fontconfig error: Cannot load default config file```
> was being generated because of the mktemp on mac, I fixed it replacing the
> code:
>
> training/tesstrain_utils.sh
> ```diff
> - export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XX)
> + export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XX)
> ```
> After executing the same code I get:
>
> ```
> === Starting training for language 'kor'
> [Wed Apr 4 14:13:38 -03 2018] /usr/local/bin/text2image
> --fonts_dir=/Library/Fonts --font=AppleMyungjo
> --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0/T/font_tmp.XX.X52wexDs/sample_text.txt
> --text=/var/folders/zz/zyxvpxvq6csfxvn_n0/T/font_tmp.XX.X52wexDs/sample_text.txt
> --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0/T/font_tmp.XX.X52wexDs
>
> === Phase I: Generating training images ===
> Rendering using AppleMyungjo
> [Wed Apr 4 14:13:40 -03 2018] /usr/local/bin/text2image
>

Re: [tesseract-ocr] Checkbox Extraction as text after Fine tuning for new characters .

2018-04-03 Thread ShreeDevi Kumar

Try to train with a large number of fonts and see if that improves the
result.

On Tue 3 Apr, 2018, 2:29 PM Apoorv Khanna,  wrote:

> Hi all,
>
> I am able to extract few check boxes after fine tuning the English model
> but tesseract is not able to extract all the check boxes .
>
> Thanks in advance
>
> version Used : *tesseract 4 beta*
> Font used for training : *Dejavu Sans*
> No of symbols inserted in training text is 14 each
>
> *Extracted text:*
> ☐not reported wnot reported zpnot reported
> cno Byes tno ☒yes ☐no ☑pyes
> not reported not reported ☐not reported
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/78dcd45b-eb3a-441c-8800-f056285998f4%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWXFmb0OdsZV1a-dwp19kyoHDO-MsCGa4NW-OuzmzC3sg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] does it make sense to train existing languages? how to fix repeatedly wrong letters?

2018-04-02 Thread ShreeDevi Kumar

My suggestion would be to do post processing of the OCR output.

On Mon 2 Apr, 2018, 6:09 PM JP T,  wrote:

> Hi
>
> I don't really got an understanding of the consequences of training.
>
> My problem:
> I've got tons of pages with a special format. ("one place study" about the
> historic inhabitants of a town)
>
> tesseract repeatedly fails on a few special words:
> oo (oh-oh) at start of line for "wedding" is often interpreted as 00 (zero
> zero)
> roman numbers 2 and 3 in Arial font are taken for lowercase LL or
> uppercase I plus lowercase LL
> */~ (birth at about) is percent %
> ~ is -
>
> my scans are of almost perfect quality (used Fred's scripts). so there is
> nothing I can do on that side any more.
> adding oo to user words did not help.
>
> Can I use training to solve these or should I instead write a script that
> fixes the mistakes after OCR?
> The problem is, that OCR needs to know some semantics. The Arial letters
> itself do hardly provide a hint which one is correct.
>
> thanks
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/5cd68a84-a7d2-4185-91c9-115c9e62d1d4%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXnu95%3DKnW5qK1-%2Brmxpt1BZ5pH6z0qi4CtYVzMiSGGVQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Extracting pristine rasterized text

2018-04-02 Thread ShreeDevi Kumar

Thank you for the detailed info.

My suggestion is to try recognition with eng.traineddata from the
tessdata_fast repository with --oem 1.


On Tue 3 Apr, 2018, 3:13 AM Patrick Ramsey, 
wrote:

> Answers below inline. And thank you very much for your help :)
>
> |PTR
>
> On Friday, March 30, 2018 at 2:00:18 AM UTC-7, shree wrote:
>>
>> Please check GitHub/issues for similar reports and suggestions.
>>
>> Also specify,
>>
> Which version/commit of tesseract 4
>>
>
> commit hash: 40f43111e05b3dd2f2f8aeae3aba33016523c881
> tag: 4.0.0-beta.1
>
> Which traineddata file, from which repo
>>
>
> eng.traineddata from https://github.com/tesseract-ocr/tessdata at commit
> 9b2e3f6642285b3e9a7a5852e5b10259e42d5510
>
>
>> Which o/s
>>
>
> Ubuntu 17.10 on amd64
>
>>
>> tesseract -v
>>
>
> tesseract 4.0.0-beta.1
>  leptonica-1.74.4
>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.1) : libpng 1.6.34 :
> libtiff 4.0.8 : zlib 1.2.11 : libwebp 0.6.0 : libopenjp2 2.2.0
>
>  Found AVX2
>  Found AVX
>  Found SSE
>
>
>
>>
>>
>
>>
>>
>> On Fri 30 Mar, 2018, 2:19 PM Patrick Ramsey, 
>> wrote:
>>
>>> Hi!
>>>
>>> So, I am running tesseract4 on clean, 1-bit images of rasterized text
>>> (not printed and scanned).  I'm getting very accurate output, as expected,
>>> but tesseract is taking about 1 second to process a single page on a core
>>> i7 cpu, and that seems a lot longer than I'd have expected.
>>>
>>> I've been trying to enable debug output so that I can see what's taking
>>> the most time, to see if there is anything that I could get away with
>>> turning off to speed it up (since I don't need to account for e.g. dirt on
>>> the lens), but thus far I'm feeling pretty stupid.  So:
>>>
>>> A) is there any straightforward way to get more information on what
>>> tesseract is actually doing? (I've built with --enable-debug and it doesn't
>>> seem to have changed the output on the command line)
>>> B) are there any control parameters you folks would suggest setting to
>>> speed up image processing/turn off unnecessary work, given the inputs I've
>>> described?
>>>
>>> Many thanks,
>>>
>>> PTR
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/893cf5f7-8f64-428e-b1fe-5e6214215059%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c709dd21-02d4-4d23-a52a-60501916c37a%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVLbi6wbRyWnNqTwAdZovBm-W%3DmZx4gTOjoCfTdrXcucA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Any suggestions for more accurate Text conversion?

2018-03-27 Thread ShreeDevi Kumar

Version mismatch. That traineddata is for 4.0.

Wiki has pages for training. Look for one appropriate for your version of
tesseract.

On Wed 28 Mar, 2018, 1:23 AM ,  wrote:

> Hi Shree,
>
> I just tried using the training data file you provided but it seems that
> there is some problem with Tesseract recognizing this file. I should have
> mentioned before that I am using version '3.05.01'.
>
> Below is the sequence of commands I ran:
>
> Bhargavs-MacBook-Pro-2:LPR bhargav$ tesseract topcrop1.jpg out -l
> end-numCAPS
>
> Error opening data file
> /usr/local/Cellar/tesseract/3.05.01/share/tessdata/end-numCAPS.traineddata
>
> Please make sure the TESSDATA_PREFIX environment variable is set to the
> parent directory of your "tessdata" directory.
>
> Failed loading language 'end-numCAPS'
>
> Tesseract couldn't load any languages!
>
> Could not initialize tesseract.
>
> Bhargavs-MacBook-Pro-2:LPR bhargav$ ls
> /usr/local/Cellar/tesseract/3.05.01/share/tessdata/
>
> configs eng.traineddata pdf.ttf
>
> eng-numCAPS.traineddata osd.traineddata tessconfigs
>
> Bhargavs-MacBook-Pro-2:LPR bhargav$ echo $TESSDATA_PREFIX
>
> /usr/local/share/tessdata
>
> Please let me know if I have done something wrong or the train data file
> has version mismatch or corrupted.
>
> Thanks,
> Bhargav
>
> On Tuesday, March 27, 2018 at 11:24:36 AM UTC-7, bha...@automot.us wrote:
>>
>> Thank you Shree. I will give it a shot with the attached train data!
>>
>> About fine-tuning, are there any example tutorials on the Tesseract wiki?
>> I am not sure. I will try to find, but I you know and post the link, I
>> would really appreciate that!
>>
>> Thanks.
>>
>> On Tuesday, March 27, 2018 at 3:00:06 AM UTC-7, shree wrote:
>>>
>>> You can try finetune training.
>>>
>>> Test with attached traineddata file.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c346ec8b-32ef-4b29-b9e6-e5d9225a31df%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXLT%2B3tepMtZ_fjufe%2Bt1WYMR4ChLdGaMuvAdj3M1t_tw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Unable to use tesseract api installed with a nuget pkg

2018-03-27 Thread ShreeDevi Kumar

I don't use visual studio. However I know that we support vs installation
via cppan cmake. Please follow those directions.

On Tue 27 Mar, 2018, 9:24 PM sonu sainju,  wrote:

> Hey Shree, Thanks for replying. No I didn't build using cppan and cmake. I
> used vcpkg install command. Isn't vcpkg supposed to acquire and install
> everything?
>
> On Sunday, March 25, 2018 at 5:46:06 PM UTC-7, shree wrote:
>>
>> Did you build using cppan and cmake?
>>
>> On Mon 26 Mar, 2018, 1:50 AM sonu sainju,  wrote:
>>
>>> Hi,
>>>
>>> I followed instruction in
>>>  https://github.com/tesseract-ocr/tesseract/wiki/Compiling#windows
>>>  to
>>> build tesseract and use it in vs2015 project. After installing tesseract
>>> via vcpkg, I exported it as a nuget pkg and added it to my project like any
>>> other nuget pkg but I am getting link errors like:
>>> Severity Code Description Project File Line Suppression State
>>> Error LNK2001 unresolved external symbol closesocket Project1 
>>> c:\Users\sonu's\documents\visual
>>> studio 2015\Projects\Project1\Project1\tesseract305.lib(svutil.cpp.obj)
>>> 1
>>> Error LNK2001 unresolved external symbol connect Project1 
>>> c:\Users\sonu's\documents\visual
>>> studio 2015\Projects\Project1\Project1\tesseract305.lib(svutil.cpp.obj)
>>> 1
>>> Error LNK2001 unresolved external symbol htons Project1 
>>> c:\Users\sonu's\documents\visual
>>> studio 2015\Projects\Project1\Project1\tesseract305.lib(svutil.cpp.obj)
>>> 1
>>> ...
>>>
>>> Is there something I have missed? Has anybody tried using tesseract api
>>> this way in vs 2015?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/d662f3a8-31d9-4100-bd44-7943444e01db%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/e3c24332-bca3-4969-b290-80f3e3054b7a%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWU3xhTZG5hi5D4OX7QCpP%3DweRTU20ckDvb72guNmWcAg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] How to merge 2 traineddata into 1 traineddata

2018-03-26 Thread ShreeDevi Kumar

Please look at
https://github.com/tesseract-ocr/tessdata_fast/tree/master/script

Look at all Han* files

maybe Hangul is the one you need.

See https://github.com/tesseract-ocr/tessdata_fast/blob/master/README.md
for more details

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Mar 26, 2018 at 2:57 PM,  wrote:

> Thank you for answer myquestion.
>
> but, i mistake write there
>
> I want to korean + english .
>
>  the way to merge traineddata is the same ?
>
> 2018년 3월 26일 월요일 오후 6시 15분 0초 UTC+9, shree 님의 말:
>>
>> Try the script level traineddata files from tessdata_fast/script
>>
>> Han probably has eng+chi*
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Mon, Mar 26, 2018 at 12:01 PM,  wrote:
>>
>>> Hi I'm newbie. I'm interested in tesseract 4.00 _beta.1
>>>
>>> I have a question
>>>
>>> How to merge 2 traineddata into 1 traineddata ?
>>>
>>> I don't want to use command line option lie -l eng+chi_tra
>>>
>>> Thank You
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/46addadc-d33e-479d-8a1b-24dcf815a8eb%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/992da2cf-bb47-4cd1-8213-1f30f80da773%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW4WgqpMrXgaZ36u85o2pV9Lm6mPKLPX3tMbOs6Jo28aQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] How to merge 2 traineddata into 1 traineddata

2018-03-26 Thread ShreeDevi Kumar

Try the script level traineddata files from tessdata_fast/script

Han probably has eng+chi*

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Mar 26, 2018 at 12:01 PM,  wrote:

> Hi I'm newbie. I'm interested in tesseract 4.00 _beta.1
>
> I have a question
>
> How to merge 2 traineddata into 1 traineddata ?
>
> I don't want to use command line option lie -l eng+chi_tra
>
> Thank You
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/46addadc-d33e-479d-8a1b-24dcf815a8eb%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVtJ%2BbTOJ2XyTyjLuqO5_ngEgPEeT%2BSaVpS-k7QFMSjVw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

1 2 3 4 5 6 7 8 >

1 - 100 of 761 matches

Mail list logo