Ryan, I had copied text with the extended range from wikipedia etc to create a quick training set. It is recommended to train with 'actual' text - I think Tesseract relies on language model data.
Please see the tutorial on tesseract from https://drive.google.com/folderview?id=0B7l10Bj_LprhQnpSRkpGMGV2eE0&usp=sharing for more background. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sat, Nov 22, 2014 at 3:20 AM, Ryan <[email protected]> wrote: > Great, thank you for the additional information. > > On Wed, Nov 19, 2014 at 7:47 PM, ShreeDevi Kumar <[email protected]> > wrote: > >> Training 2 files >> >> ShreeDevi >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> On Thu, Nov 20, 2014 at 9:15 AM, ShreeDevi Kumar <[email protected]> >> wrote: >> >>> Training 1 files >>> >>> ShreeDevi >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >>> On Thu, Nov 20, 2014 at 8:54 AM, ShreeDevi Kumar <[email protected]> >>> wrote: >>> >>>> Hi Ryan, >>>> >>>> Attached are couple of training logs and their unicharsets, these will >>>> have details of fonts used (these are from 2 different trainings). I tried >>>> to use fonts that support the full range and created box/tiff using >>>> Jtessboxeditor and did rest of training using modified tesstrain.sh. >>>> >>>> Most of fonts used are what's available on windows. >>>> >>>> Additionally, I am using the development version of FreeSerif (from GNU >>>> freefont project - https://www.gnu.org/software/freefont/). >>>> >>>> I also used Siddhanta (which I use mainly for sanskrit but which has >>>> support for the accented letters too), you can download that from >>>> http://www.svayambhava.org/ >>>> >>>> I can send you the box/tiff pairs that I used, in case you want them, >>>> in addition to your own training images. >>>> >>>> ShreeDevi >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>>> On Thu, Nov 20, 2014 at 5:07 AM, Ryan Dev < >>>> [email protected]> wrote: >>>> >>>>> I'm dealing with font subsets, and I generate an image per font, so >>>>> there is no reading order. Though I've seen latin and cjk in the same font >>>>> subset. If OSD just gives, reading, orientation, and text order, it is not >>>>> going to give me anything useful. Plus I have the font, so I could get >>>>> some >>>>> of that info from the font, just no idea what language (though maybe I >>>>> should go back and take another look...). >>>>> >>>>> I've got training up and running, on Ubuntu. I modified the text file >>>>> you gave me, just adding some missing ligatures (ff, ffi, ffl), but my >>>>> asc.traineddata is way worse then yours. >>>>> >>>>> *Do you have a list of fonts you used to create asc.traineddata that I >>>>> could start with*? For example, I think my fonts are missing the old >>>>> ascii drawing blocks that you include, and which works great on the fonts >>>>> that use those (for bullets usually). >>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/9f084bab-80b2-4c3b-9de8-9add618a8484%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/9f084bab-80b2-4c3b-9de8-9add618a8484%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> >>> >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXi8Obzw%2B9gYw%2B%3DkQ1wyDVag1UvkE-GB%2Bt1_vFXYRPPkQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

