Re: [tesseract-ocr] Spanish text better processed in eng than in spa

2017-08-29 Thread ShreeDevi Kumar
I have opened this as an issue at https://github.com/tesserac t-ocr/tessdata/issues/77 You can provide additional feedback there. @theraysmith is doing the training at Google. The examples you provide will be helpful to him and improve future training. ShreeDevi

Re: [tesseract-ocr] Spanish text better processed in eng than in spa

2017-08-29 Thread valentin . depablo
spa and latin within best folders are moreless equivalent, there is no significant difference, although there are several failures they are quite reasonable. The one that provide real bad output are the official ones that are automatically installed. Do you need help training the data? (is a

Re: [tesseract-ocr] Spanish text better processed in eng than in spa

2017-08-28 Thread ShreeDevi Kumar
>Btw, is there any way to tell tesseract that values are in a table, so that it will not make a mistake identifying lines with charts? I don't think tesseract has that ability. You will need to preprocess the image to remove lines. Leptonica has functions to do that, as well as a table detector.

Re: [tesseract-ocr] Spanish text better processed in eng than in spa

2017-08-28 Thread ShreeDevi Kumar
I had not checked the list. It should actually be Latin.traineddata for all languages written in Latin script. Not Spanish, as I had written. On 29-Aug-2017 3:54 AM, wrote: > So... I have installed the default tessdata used by the installer, which > seems to be this

Re: [tesseract-ocr] Spanish text better processed in eng than in spa

2017-08-28 Thread valentin . depablo
So... I have installed the default tessdata used by the installer, which seems to be this one: https://github.com/tesseract-ocr/tessdata/blob/master/spa.traineddata Looking to your comment I have installed the package: https://github.com/tesseract-ocr/tessdata/blob/master/best/spa.traineddata

Re: [tesseract-ocr] Spanish text better processed in eng than in spa

2017-08-28 Thread ShreeDevi Kumar
Have you tried with the 'best' traineddatas? What about results using best/Spanish vs best/spa? I have opened this as an issue at https://github.com/tesseract-ocr/tessdata/issues/77 You can provide additional feedback there. ShreeDevi

[tesseract-ocr] Spanish text better processed in eng than in spa

2017-08-27 Thread valentin . depablo
So... after following the instructions from quality improvement: https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality I found what I think is a nice picture, I attach you tessinput.tif file I received as output. When I ran tesseract 4.0.0 on the image I found that actually the eng