[tesseract-ocr] Difference trained data for Chinese

2017-08-11 Thread Yang Yu
Good day!

Recently I was using tesseract (4.0 alpha) to do Chinese OCR and it works 
really great. Now I want to pick up a best model to use but I find several 
versions. What is the difference between them?

1. chi_sim from https://github.com/tesseract-ocr/tesseract/wiki/Data-Files 
(around 50M)
2. chi_sim from https://github.com/tesseract-ocr/tessdata/tree/master/best 
(around 13M)
3. chi_sim_vert 
from https://github.com/tesseract-ocr/tessdata/tree/master/best (around 13M)
4. HanS from https://github.com/tesseract-ocr/tessdata/tree/master/best 
(around 16M)

All of them can work but the results are slightly different. From my own 
evaluation #4 is the best, but I don't have any insight.

Appreciate for any help.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8cc88ed2-99c3-445e-b758-83ade0f680aa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Difference trained data for Chinese

2017-08-11 Thread shree
Please see https://github.com/tesseract-ocr/tessdata/issues/72 



On Friday, August 11, 2017 at 2:26:55 PM UTC+5:30, Yang Yu wrote:
>
> Good day!
>
> Recently I was using tesseract (4.0 alpha) to do Chinese OCR and it works 
> really great. Now I want to pick up a best model to use but I find several 
> versions. What is the difference between them?
>
> 1. chi_sim from https://github.com/tesseract-ocr/tesseract/wiki/Data-Files 
> (around 50M)
> 2. chi_sim from https://github.com/tesseract-ocr/tessdata/tree/master/best 
> (around 13M)
> 3. chi_sim_vert from 
> https://github.com/tesseract-ocr/tessdata/tree/master/best (around 13M)
> 4. HanS from https://github.com/tesseract-ocr/tessdata/tree/master/best 
> (around 16M)
>
> All of them can work but the results are slightly different. From my own 
> evaluation #4 is the best, but I don't have any insight.
>
> Appreciate for any help.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2b698539-1bd3-4ad6-b753-84b90d13f79b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: How Much TO Enlarge Screenshot?

2017-08-11 Thread Dbsk Dbsk
cause tesseract operate at 300dpi, so you should change to that level. for 
example the screen dpi is 72, then enlarge the screenshot to 400% 

On Friday, August 4, 2017 at 1:56:32 AM UTC+8, James Lee wrote:
>
> Is there way to find out how much to enlarge a screenshot for best 
> accuracy?
> Is there a math formula if I know the internal display resolution (not 
> dpi?)
> Thanks!
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c342a5cc-006b-4a0b-a352-d2394e57ee85%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] How to know how many symbol is a word in pagelayout?

2017-08-11 Thread Dbsk Dbsk
i can use the code below to draw every word and every symbol bounding box, 
now i want to if i can know how many symbol in the word  when i got a 
word?   

thanks for any info!

=
#include 
#include 
#include 

#include 

using namespace std;

int main() {
std::cout << "Hello, World!" << std::endl;
tesseract::TessBaseAPI api ;
api.InitForAnalysePage();

api.SetPageSegMode(tesseract::PSM_SPARSE_TEXT);

Pix *image = pixRead("/tmp/ytmp/en4.png");

//process gray color to white

l_uint32 pixel_color;

l_int32 r,g,b;

l_int32 width,height,depth;

width=0;
height=0;



pixGetDimensions(image,&width,&height,&depth);

printf("w=%d h=%d dep=%d\n",width,height,depth);

api.SetImage(image);

tesseract::PageIterator *iter = api.AnalyseLayout(true);


int word_count=0;
while (iter->Next(tesseract::RIL_WORD)) {
int left, top, right, bottom;
++word_count;
iter->BoundingBox(
tesseract::RIL_WORD,
&left, &top, &right, &bottom
);



* 
//===  //i 
got the word bounding box, but i want to know how may symbol in this 
word?  
//===*

pixRenderLine(image,left,top,left,bottom,3,L_CLEAR_PIXELS);
pixRenderLine(image,left,top,right,top,3,L_CLEAR_PIXELS);
pixRenderLine(image,left,bottom,right,bottom,3,L_CLEAR_PIXELS);
pixRenderLine(image,right,top,right,bottom,3,L_CLEAR_PIXELS);



}


iter->Begin();
while (iter->Next(tesseract::RIL_SYMBOL)) {
int left, top, right, bottom;
++word_count;
iter->BoundingBox(
tesseract::RIL_SYMBOL,
&left, &top, &right, &bottom
);

pixRenderLine(image,left,top,left,bottom,1,L_CLEAR_PIXELS);
pixRenderLine(image,left,top,right,top,1,L_CLEAR_PIXELS);
pixRenderLine(image,left,bottom,right,bottom,1,L_CLEAR_PIXELS);
pixRenderLine(image,right,top,right,bottom,1,L_CLEAR_PIXELS);

}

pixWrite("/tmp/ytmp/entt.png",image,IFF_PNG);


return 0;
}

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1da35024-16f1-404a-aa2a-e06e1377aacf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.