[tesseract-ocr] How to recognize some specific symbols with Tess4.0

2017-07-31 Thread robertyoung0511









Hello,

I'm trying to apply Tess4.0 to recongnize the simplified Chinese with the 
command as:
  argc = 13;
  argv[1] = "E:/数据库/yanghui_results/yanghui_100_0.jpg";
  argv[2] = "E:/sample/01";
  argv[3] = "-l";
  argv[4] = "chi_sim+eng";
  argv[5] = "-psm";
  argv[6] = "7";
  argv[7] = "--oem";
  argv[8] = "OEM_TESSERACT_LSTM_COMBINED";
  argv[9] = "--tessdata-dir";
  argv[10] = "../tessdata";
  argv[11] = "--user-words";
  argv[12] = "../tessdata/chi_sim.user-words";

I have used the chi_sim and eng traineddata as the tessdata language, but 
some specific symbols, such as '∠' (means an angle), cannot be correctly 
recognized.


For example, an image demonstrated in above is the input data of Tess4.0, 
and the results is shown as the following:
如图, 在口ABCD中, 点E, F在AC上, 且乙ABE=乙CDF, 求证: BE=DF,

>From the results, we can observe that the '∠' symbol has been recognized as 
'乙', and the *rhomboid  symbol is recognized as '口', '.' 
period symbol as ',' **comma  *



*symbol .How to correctly recognized these specific 
symbols with Tess4.0? Can you help me?*

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8c00f7b8-1d84-4824-96a4-c8c2e50781bc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Building tesseract 4.0.0 from master on OS X

2017-07-31 Thread Kevin Schiesser
I used brew to install the dependencies and then ran the following:

$ ./autogen.sh
$ make
$ sudo make install
$ make training

The last command exits with the following:

ld: library not found for -lgobject-2.0
collect2: error: ld returned 1 exit status
make[1]: *** [text2image] Error 1
make: *** [training] Error 2

On Monday, July 31, 2017 at 12:42:32 PM UTC-7, Stefan Weil wrote:
>
> Kevin, how did you run the failing builds on macOS?
>
> I just tested building with `brew install tesseract --HEAD 
> --with-training-tools` and had no problems.
> An automake based builds also works with MacPorts.
> No modifications were needed for Tesseract git master.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e57c4b75-edd3-4b3c-bf01-53ae51a2d03f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Building tesseract 4.0.0 from master on OS X

2017-07-31 Thread 'Stefan Weil' via tesseract-ocr
Kevin, how did you run the failing builds on macOS?

I just tested building with `brew install tesseract --HEAD 
--with-training-tools` and had no problems.
An automake based builds also works with MacPorts.
No modifications were needed for Tesseract git master.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/59e3c0e8-1ee6-4f2d-a60c-a25aed1f1f87%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] tesseract-ocr-ell, tesseract-ocr-grc: improvements

2017-07-31 Thread dimitrDimitr
At http://www.elspell.gr/myspell there is OpenOffice Greek Dictionary v0.9 
 with 800.000 
greek words encoded with windows-1253, under MPL 1.1/GPL 2.0/LGPL 2.1 
License.

Polytonic characters aren't used after 1982 and we don't have wordlists for 
them. 

Only sources like the Bible have polytonic words but they don't belong to 
modern greek. 

The maintainer of tesseract-ocr-grc uses a wordlist based on ancient greek 
polytonic texts.

The greek polytonic unicode characters U+1F00 to U+1FFC aren't useful in 
the packet tesseract-ocr-ell, and they may confuse ocr recognition.

On the opposite side tesseract-ocr-grc must have the polytonic characters 
and not the monotonic greek characters U+0386 to U+03CE.



-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0c016239-d530-4ae7-9328-b1787d91d15f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Tesseract AdaptToWordStr usage?

2017-07-31 Thread Syed Uzair
Sorry made a mistake in attached file names. 
When i uncomment lines 18,19 my console reads like output3.png (attachment).
output2.png is the debug file.

Thanks

On Monday, July 31, 2017 at 5:09:37 PM UTC+5:30, Syed Uzair wrote:
>
> Hello all
>
> I am trying to extract text from the attached image (010003.bin.png) using 
> tesserocr (python wrapper for Tesseract 3.04 API). When i used the script 
> TestAdapttoWord.py (attachment) with the lines 18,19 commented my console 
> reads like output1.png (attachment) and when i uncomment lines 18,19 my 
> console reads like output2.png (attachment).
> According to AdaptToWordStr documentation, it will return true if it was 
> able to adapt to the given word. I am getting true but after that when i do 
> GetUTF8Text i get empty results. I was hoping it would give correct result 
> after AdaptToWordStr returns true.
>
> I am not sure whether i am using AdapttoWordsStr correctly or not because 
> the documentation doesn't say much. Is my interpretation of AdaptToWordStr 
> is correct?  
> I am on Ubuntu 16 using Tesseract 3.04.
>
> Thanks
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9e6dbead-5c84-4f30-babd-fd9936c595fd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] ERROR: Could not find training text file

2017-07-31 Thread ShreeDevi Kumar
add a line similar to following to your training command, pointing to where
you have your training text

  --training_text ../langdata/eng/eng.training_text \


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jul 31, 2017 at 4:24 PM, Ava Nimaee  wrote:

> Hi . sorry I used this syntax:
> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng
> --linedata_only \
>   --noextract_font_properties --langdata_dir langdata \
>   --tessdata_dir tessdata \
>   --fontlist "Times New Roman," --output_dir engtrain
> Befor that i create boxfile and tif and Ucnicahset_output
> I clone langdata for tesseract v4.0
> but take this error:
>  === Phase I: Generating training images ===
> ERROR: Could not find training text file langdata/eng/eng.training_text
> i can't solve it and i don't know where should i put taining_text.txt
> actually it is a text file that i want train it.
> Thanks for attention.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/a141d688-bc59-4485-b7bc-66ac650ebfd8%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU_zLd1N7aSvfD%3D5wtX3%2BpOeBAnkTgmh47qcwaJfGUWPA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Tesseract AdaptToWordStr usage?

2017-07-31 Thread Syed Uzair
Hello all

I am trying to extract text from the attached image (010003.bin.png) using 
tesserocr (python wrapper for Tesseract 3.04 API). When i used the script 
TestAdapttoWord.py (attachment) with the lines 18,19 commented my console 
reads like output1.png (attachment) and when i uncomment lines 18,19 my 
console reads like output2.png (attachment).
According to AdaptToWordStr documentation, it will return true if it was 
able to adapt to the given word. I am getting true but after that when i do 
GetUTF8Text i get empty results. I was hoping it would give correct result 
after AdaptToWordStr returns true.

I am not sure whether i am using AdapttoWordsStr correctly or not because 
the documentation doesn't say much. Is my interpretation of AdaptToWordStr 
is correct?  
I am on Ubuntu 16 using Tesseract 3.04.

Thanks


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b31f615f-4d87-4cf1-b046-b337bd709764%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
from PIL import Image
from tesserocr import PyTessBaseAPI, RIL, PSM
import tesserocr

#print tesserocr.tesseract_version()  # print tesseract-ocr version

image = Image.open('010003.bin.png')
with PyTessBaseAPI() as api:
api.SetImage(image)
api.SetDebugVariable("debug_file","debug.txt")
boxes = api.GetComponentImages(RIL.WORD, True)
print 'Found {} word image components.'.format(len(boxes))
list=['( b )','S a l e s','o f','T r a d e d','G o o d s']
for i, (im, box, _, _) in enumerate(boxes):
	#im.show()
	api.SetPageSegMode(8)
	api.SetRectangle(box['x'], box['y'], box['w'], box['h'])
	b = api.AdaptToWordStr(psm=8,word = list[i])
	print b
	ocrResult = api.GetUTF8Text()
	print "Word"+str(i)+" Text:"+ocrResult
	conf = api.MeanTextConf()



[tesseract-ocr] ERROR: Could not find training text file

2017-07-31 Thread Ava Nimaee
Hi . sorry I used this syntax:
training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng 
--linedata_only \
  --noextract_font_properties --langdata_dir langdata \
  --tessdata_dir tessdata \
  --fontlist "Times New Roman," --output_dir engtrain
Befor that i create boxfile and tif and Ucnicahset_output
I clone langdata for tesseract v4.0
but take this error:
 === Phase I: Generating training images ===
ERROR: Could not find training text file langdata/eng/eng.training_text
i can't solve it and i don't know where should i put taining_text.txt 
actually it is a text file that i want train it.
Thanks for attention.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a141d688-bc59-4485-b7bc-66ac650ebfd8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.