I want to extract numbers from an image. Usually the numbers are around 
some figure and sometimes within the figure. I'm using Tesseract for this 
task. Tesseract works quite well for documents with a lot of text but I 
have not really found the right parameters to get good results for this 
task. I tried different page segmentation modes (PSM_SPARSE_TEXT should in 
theory work best here), all different engine modes, character whitelist, 
disabled table detection, disabled dictionary and so on.

Usually the images look like the attached 'NumbersWithFigure'.

[image: NumbersWithFigure.jpg]

But also using a 'cleaned' image like the attached 'OnlyNumbers' didn't 
really bring better results.

[image: OnlyNumbers.jpg]

I'm using Tess4j 
<https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j/4.5.3> to 
access Tesseract with Java like this:
Tesseract1 tesseract = new Tesseract1(); //default-lang is eng, default OEM 
is TessOcrEngineMode.OEM_DEFAULT; 
tesseract.setTessVariable("textord_tabfind_find_tables", "0"); //table 
detection disabled tesseract.setTessVariable("tessedit_enable_doc_dict", 
"0"); //don't use dictionary 
tesseract.setTessVariable("tessedit_char_whitelist", "0123456789"); //only 
numbers tesseract.setTessVariable("load_system_dawg", "0"); // system 
dictionary will not be loaded. 
tesseract.setPageSegMode(TessPageSegMode.PSM_SPARSE_TEXT); 
tesseract.setDatapath(new File("./tessdata/").getAbsolutePath()); 
System.out.println("Words: " + tesseract.getWords(entry.getValue(), 
TessPageIteratorLevel.RIL_WORD));  

Any ideas (parameters and/or links to specialized training data)?

I've also posted this question on StackOverflow (here 
<https://stackoverflow.com/questions/64354275/>), but maybe I got more luck 
here :-)


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8b5f2a5b-6ad9-4bc0-9454-33cafc03b88dn%40googlegroups.com.

Reply via email to