Hi Ryan Thanks very much for such a useful answer! I'm building your docker container as I type and I'll try with font training when its built.
I tried looking at training with boxes and images, but it complained about a good number of my boxes - saying it couldn't detect blobs within them. I'm guessing my problem is that I don't have good separation of characters, so I plan to look at whether I can just remove those boxes or whether edit the images to remove some characters. Phil On Wednesday, 1 April 2015 17:59:35 UTC+1, Ryan Baumann wrote: > > Also, to answer your other questions: > > > - There appear to be some other issues with Pango/Cairo rendering > under OS X which may impact the training process, as a result and for > general replicability I now use a Dockerized Linux environment to do > Tesseract training on my Mac: > https://github.com/ryanfb/tesseract_latinocr_docker > - Training from fonts works surprisingly well, but if there are > significant artifacts introduced by your pipeline/capture process, you may > get better accuracy with a manual box/train against images. > > -Ryan > > On Tuesday, March 31, 2015 at 3:43:23 PM UTC-4, Philip Pearl wrote: >> >> Hi All >> >> I'm trying to train tesseract for the first time on my Mac. I'm running >> text2image as follows, but it is crashing in Pango as the priv data on the >> font is NULL. >> >> /usr/local/Cellar/tesseract/HEAD/bin//text2image --leading=32 >> --fonts_dir=/Library/Fonts --box_padding=0 --strip_unrenderable_words >> --char_spacing=0.0 --exposure=0 --find_fonts=true >> --outputbase=/tmp/tesstrain/eng/eng.Helvetica_Neue_Thin.exp0 >> --text=./tesslang/eng/eng.training_text >> >> Thread 0 Crashed:: Dispatch queue: com.apple.main-thread >> >> 0 libpangoft2-1.0.0.dylib 0x00000001090fad9e >> pango_fc_font_get_glyph + 25 >> >> 1 text2image 0x000000010858bf58 >> tesseract::PangoFontInfo::CanRenderString(char const*, int, >> std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, >> std::__1::allocator<char> >, >> std::__1::allocator<std::__1::basic_string<char, >> std::__1::char_traits<char>, std::__1::allocator<char> > > >*) const + 322 >> >> 2 text2image 0x000000010858d0ab >> tesseract::FontUtils::SelectFont(char const*, int, >> std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, >> std::__1::allocator<char> >, >> std::__1::allocator<std::__1::basic_string<char, >> std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, >> std::__1::basic_string<char, std::__1::char_traits<char>, >> std::__1::allocator<char> >*, std::__1::vector<std::__1::basic_string<char, >> std::__1::char_traits<char>, std::__1::allocator<char> >, >> std::__1::allocator<std::__1::basic_string<char, >> std::__1::char_traits<char>, std::__1::allocator<char> > > >*) + 287 >> >> 3 text2image 0x0000000108592c06 >> tesseract::StringRenderer::RenderAllFontsToImage(double, char const*, int, >> std::__1::basic_string<char, std::__1::char_traits<char>, >> std::__1::allocator<char> >*, Pix**) + 108 >> >> 4 text2image 0x0000000108584149 main + 2750 >> >> 5 libdyld.dylib 0x00007fff932315fd start + 1 >> >> >> I installed from HEAD using homebrew and the instructions I found here >> https://ryanfb.github.io/etc/2014/11/19/installing_tesseract_training_tools_on_mac_os_x.html >> >> >> - Any ideas how to get around this crash? >> - Am I crazy running this on my Mac? Would I be better off with a >> Linux VM? >> - Does training from fonts work or am I better off starting with >> images (my data is analog HD screen captures of TV menus!)? I know the >> font >> the menus use. >> >> Thanks in advance for any help or advice you are able to give me. >> >> Phil >> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/86143830-79d5-4305-be5f-3ac58dfb52b1%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

