Ryan,

I had copied text with the extended range from wikipedia etc to create a
quick training set. It is recommended to train with 'actual' text - I think
Tesseract relies on language model data.

Please see the tutorial on tesseract from
https://drive.google.com/folderview?id=0B7l10Bj_LprhQnpSRkpGMGV2eE0&usp=sharing
for more background.


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Nov 22, 2014 at 3:20 AM, Ryan <[email protected]>
wrote:

> Great, thank you for the additional information.
>
> On Wed, Nov 19, 2014 at 7:47 PM, ShreeDevi Kumar <[email protected]>
> wrote:
>
>> Training 2 files
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Thu, Nov 20, 2014 at 9:15 AM, ShreeDevi Kumar <[email protected]>
>> wrote:
>>
>>> Training 1 files
>>>
>>> ShreeDevi
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Thu, Nov 20, 2014 at 8:54 AM, ShreeDevi Kumar <[email protected]>
>>> wrote:
>>>
>>>> Hi Ryan,
>>>>
>>>> Attached are couple of training logs and their unicharsets, these will
>>>> have details of fonts used (these are from 2 different trainings). I tried
>>>> to use fonts that support the full range and created box/tiff using
>>>> Jtessboxeditor and did rest of training using modified tesstrain.sh.
>>>>
>>>> Most of fonts used are what's available on windows.
>>>>
>>>> Additionally, I am using the development version of FreeSerif (from GNU
>>>> freefont project - https://www.gnu.org/software/freefont/).
>>>>
>>>> I also used Siddhanta (which I use mainly for sanskrit but which has
>>>> support for the accented letters too), you can download that from
>>>> http://www.svayambhava.org/
>>>>
>>>> I can send you the box/tiff pairs that I used, in case you want them,
>>>> in addition to your own training images.
>>>>
>>>> ShreeDevi
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>> On Thu, Nov 20, 2014 at 5:07 AM, Ryan Dev <
>>>> [email protected]> wrote:
>>>>
>>>>> I'm dealing with font subsets, and I generate an image per font, so
>>>>> there is no reading order. Though I've seen latin and cjk in the same font
>>>>> subset. If OSD just gives, reading, orientation, and text order, it is not
>>>>> going to give me anything useful. Plus I have the font, so I could get 
>>>>> some
>>>>> of that info from the font, just no idea what language (though maybe I
>>>>> should go back and take another look...).
>>>>>
>>>>> I've got training up and running, on Ubuntu. I modified the text file
>>>>> you gave me, just adding some missing ligatures (ff, ffi, ffl), but my
>>>>> asc.traineddata is way worse then yours.
>>>>>
>>>>> *Do you have a list of fonts you used to create asc.traineddata that I
>>>>> could start with*? For example, I think my fonts are missing the old
>>>>> ascii drawing blocks  that you include, and which works great on the fonts
>>>>> that use those (for bullets usually).
>>>>>
>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/9f084bab-80b2-4c3b-9de8-9add618a8484%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/9f084bab-80b2-4c3b-9de8-9add618a8484%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXi8Obzw%2B9gYw%2B%3DkQ1wyDVag1UvkE-GB%2Bt1_vFXYRPPkQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to