[tesseract-ocr] Difference trained data for Chinese
Good day! Recently I was using tesseract (4.0 alpha) to do Chinese OCR and it works really great. Now I want to pick up a best model to use but I find several versions. What is the difference between them? 1. chi_sim from https://github.com/tesseract-ocr/tesseract/wiki/Data-Files (around 50M) 2. chi_sim from https://github.com/tesseract-ocr/tessdata/tree/master/best (around 13M) 3. chi_sim_vert from https://github.com/tesseract-ocr/tessdata/tree/master/best (around 13M) 4. HanS from https://github.com/tesseract-ocr/tessdata/tree/master/best (around 16M) All of them can work but the results are slightly different. From my own evaluation #4 is the best, but I don't have any insight. Appreciate for any help. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8cc88ed2-99c3-445e-b758-83ade0f680aa%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[tesseract-ocr] Re: Difference trained data for Chinese
Please see https://github.com/tesseract-ocr/tessdata/issues/72 On Friday, August 11, 2017 at 2:26:55 PM UTC+5:30, Yang Yu wrote: > > Good day! > > Recently I was using tesseract (4.0 alpha) to do Chinese OCR and it works > really great. Now I want to pick up a best model to use but I find several > versions. What is the difference between them? > > 1. chi_sim from https://github.com/tesseract-ocr/tesseract/wiki/Data-Files > (around 50M) > 2. chi_sim from https://github.com/tesseract-ocr/tessdata/tree/master/best > (around 13M) > 3. chi_sim_vert from > https://github.com/tesseract-ocr/tessdata/tree/master/best (around 13M) > 4. HanS from https://github.com/tesseract-ocr/tessdata/tree/master/best > (around 16M) > > All of them can work but the results are slightly different. From my own > evaluation #4 is the best, but I don't have any insight. > > Appreciate for any help. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2b698539-1bd3-4ad6-b753-84b90d13f79b%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[tesseract-ocr] Re: How Much TO Enlarge Screenshot?
cause tesseract operate at 300dpi, so you should change to that level. for example the screen dpi is 72, then enlarge the screenshot to 400% On Friday, August 4, 2017 at 1:56:32 AM UTC+8, James Lee wrote: > > Is there way to find out how much to enlarge a screenshot for best > accuracy? > Is there a math formula if I know the internal display resolution (not > dpi?) > Thanks! > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c342a5cc-006b-4a0b-a352-d2394e57ee85%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[tesseract-ocr] How to know how many symbol is a word in pagelayout?
i can use the code below to draw every word and every symbol bounding box, now i want to if i can know how many symbol in the word when i got a word? thanks for any info! = #include #include #include #include using namespace std; int main() { std::cout << "Hello, World!" << std::endl; tesseract::TessBaseAPI api ; api.InitForAnalysePage(); api.SetPageSegMode(tesseract::PSM_SPARSE_TEXT); Pix *image = pixRead("/tmp/ytmp/en4.png"); //process gray color to white l_uint32 pixel_color; l_int32 r,g,b; l_int32 width,height,depth; width=0; height=0; pixGetDimensions(image,&width,&height,&depth); printf("w=%d h=%d dep=%d\n",width,height,depth); api.SetImage(image); tesseract::PageIterator *iter = api.AnalyseLayout(true); int word_count=0; while (iter->Next(tesseract::RIL_WORD)) { int left, top, right, bottom; ++word_count; iter->BoundingBox( tesseract::RIL_WORD, &left, &top, &right, &bottom ); * //=== //i got the word bounding box, but i want to know how may symbol in this word? //===* pixRenderLine(image,left,top,left,bottom,3,L_CLEAR_PIXELS); pixRenderLine(image,left,top,right,top,3,L_CLEAR_PIXELS); pixRenderLine(image,left,bottom,right,bottom,3,L_CLEAR_PIXELS); pixRenderLine(image,right,top,right,bottom,3,L_CLEAR_PIXELS); } iter->Begin(); while (iter->Next(tesseract::RIL_SYMBOL)) { int left, top, right, bottom; ++word_count; iter->BoundingBox( tesseract::RIL_SYMBOL, &left, &top, &right, &bottom ); pixRenderLine(image,left,top,left,bottom,1,L_CLEAR_PIXELS); pixRenderLine(image,left,top,right,top,1,L_CLEAR_PIXELS); pixRenderLine(image,left,bottom,right,bottom,1,L_CLEAR_PIXELS); pixRenderLine(image,right,top,right,bottom,1,L_CLEAR_PIXELS); } pixWrite("/tmp/ytmp/entt.png",image,IFF_PNG); return 0; } -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1da35024-16f1-404a-aa2a-e06e1377aacf%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.