[tesseract-ocr] Re: use jTesseractEdit training but box edit is empty

2017-06-07 Thread Shaw Ryan
Yes box file is empty.
I will try to process image first,
Appreciating what you done for me. Thank you
在 2017年6月8日星期四 UTC+8上午2:31:41,Quan Nguyen写道:
>
> I don't see any box file, but from the appearance of the image, Tesseract 
> probably had problems recognizing it, therefore, producing an empty box 
> file. You'll need to perform some image processing first to make the image 
> more amenable to Tesseract.
>
> On Tuesday, June 6, 2017 at 9:44:58 PM UTC-5, Shaw Ryan wrote:
>>
>> Thank you 
>> I have uploaded box and tiff
>> Please help
>> 在 2017年6月5日星期一 UTC+8下午6:27:14,Shaw Ryan写道:
>>>
>>>
>>> 
>>> How can I edit the data?
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8434a139-94a9-4949-ad9f-5f897ea18f65%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: What is the "Confidence"value returned by Tesseract and how it is calculated?

2017-06-07 Thread akhil katpally
 3-> Yes you can get the confidence at the character level ... please see 
the tesseract api examples ... 
https://github.com/tesseract-ocr/tesseract/wiki/APIExample#example-of-iterator-over-the-classifier-choices-for-a-single-symbol
 
  
 1-> Don't know .. i am looking for it as well. Hope this will be helpful 
.. When ever tesseract tries to recognizes a particular character it has 
different choices for that letter, of all those it takes one with maximum 
confidence value and returns to us ... you can even get the difference 
choices and its confidence with tesseract::ChoiceIterator() method.
2-> What do you mean by changing accuracy levels of tesseract?   
On Thursday, June 1, 2017 at 4:09:12 AM UTC-7, Thilina Jayathilaka wrote:
>
> Hello, 
>
> 1. I need to know what is the confidence value (returned by tesseract API) 
> and how it calculates that value? 
>
> 2. Is there any possibility that I can change the accuracy levels of 
> tesseract? 
>
> 3. Can I detect the confidence value for *each letter* separately when I 
> pass an image which contains a *word*?
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d94473bf-49cc-416d-8ff1-daa3458abd98%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: how to use tesseract to detect table?

2017-06-07 Thread akhil katpally
You can use tesseract parameters .. internally tesseract detects the tables 
you can leverage that information and print it out ... and also one of the 
parameter will print you out the detected table information (coordinates). 
textord_dump_table_images ---  Show table regions (this would dump 
intermediate images which will ) 
textord_tablefind_show_stats  ---Show page stats used in table 
finding 
and there are some more you can try them ... 
to use the parameters in the command line you can use -c option followed by 
parameters. 
   

On Monday, April 17, 2017 at 1:03:03 AM UTC-7, Azka Gilani wrote:
>
> @johnny did you find anything in that? i am stuck on the same problem.
> @dinh van Chinh that method doesn't use tesseract api!
>
> On Monday, July 18, 2016 at 6:05:54 PM UTC-4, Johnny ho wrote:
>>
>> Are there any examples to show how to use Tesseract to detect tables in 
>> an images?
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/253415aa-24f7-48af-aa5c-564a597d975c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: use jTesseractEdit training but box edit is empty

2017-06-07 Thread Quan Nguyen
I don't see any box file, but from the appearance of the image, Tesseract 
probably had problems recognizing it, therefore, producing an empty box 
file. You'll need to perform some image processing first to make the image 
more amenable to Tesseract.

On Tuesday, June 6, 2017 at 9:44:58 PM UTC-5, Shaw Ryan wrote:
>
> Thank you 
> I have uploaded box and tiff
> Please help
> 在 2017年6月5日星期一 UTC+8下午6:27:14,Shaw Ryan写道:
>>
>>
>> 
>> How can I edit the data?
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/08f51a21-b26d-4c33-98ed-f6fc6336a934%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Place of feature extraction in optical character recognition

2017-06-07 Thread PashaTurkish
Hi all

My question is closer to OCR theory then to tesseract-ocr but I post it 
here because it anyway is related with ocr and ocr software.

I am learning OCR and reading this book 
https://www.amazon.com/Character-Recognition-Different-Languages-Computing/dp/3319502514
 
 

The authors define 8 processes to implement OCR that follow one by one (2 
after 1, 3 after 2 etc):

   1. Optical scanning
   2. Location segmentation
   3. Pre-processing
   4. Segmentation
   5. Representation
   6. Feature extraction
   7. Recognition
   8. Post-processing

This is what they write about representation (#5)

The fifth OCR component is representation. The image representation plays 
one of the most important roles in any recognition system. In the simplest 
case, gray level or binary images are fed to a recognizer. However, in most 
of the recognition systems in order to avoid extra complexity and to 
increase the accuracy of the algorithms, a more compact and characteristic 
representation is required. For this purpose, a set of features is 
extracted for each class that helps distinguish it from other classes while 
remaining invariant to characteristic differences within the class.The 
character image representation methods are generally categorized into three 
major groups: (a) global transformation and series expansion (b) 
statistical representation and (c) geometrical and topological 
representation.

This is what they write about feature extraction (#6)

The sixth OCR component is feature extraction. The objective of feature 
extraction is to capture essential characteristics of symbols. Feature 
extraction is accepted as one of the most difficult problems of pattern 
recognition. The most straight forward way of describing character is by 
actual raster image. Another approach is to extract certain features that 
characterize symbols but leaves the unimportant attributes. The techniques 
for extraction of such features are divided into three groups’ viz. (a) 
distribution of points (b) transformations and series expansions and (c) 
structural analysis.

Please, explain, why feature extraction is after representation, but not 
before it. As I understand at representation we get from image (!) certain 
model of character, so after that we must match this model to certain 
class. I don't understand what we do at feature extraction. Or I understand 
everything wrong. Please, help.


The question was also asked on SO 
https://stackoverflow.com/questions/44396721/place-of-feature-extraction-in-optical-character-recognition
 



Best regards, Pavel

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/47a66d93-ff4e-48c1-9144-3a17d01614c3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] improve image so i can better OCR

2017-06-07 Thread eliav schmulewitz


Hi

I posted this on stackoverflow but got no response...


I am trying to read subtitles from an image taken from the news using 
tesserract on python. 
for some reasons I get better results when saving the file using plt and 
using tesseract reading it from there

   1. Why is that?
   2. How can I refine my results using cv2?

import urllib3import requestsimport numpy as npimport pytesseractimport 
matplotlib.pyplot as pltfrom  PIL import Imagedef downloadFile():
url = 
'https://drive.google.com/uc?export=download=0B7t_yZLolnbiaVpicnEwbDRjTmc'
http = urllib3.PoolManager()
r = http.request('GET',url)
f = open('testing.npy', 'wb')
f.write(r.data)

downloadFile()
frame = np.load('testing.npy')
new_frame = frame[170:210,8:195]
plt.imshow(new_frame)
plt.axis('off')
plt.savefig('plt.png')print('from array: ' + 
pytesseract.image_to_string(Image.fromarray(new_frame),lang = 'eng'))print( 
'from plt: ' + pytesseract.image_to_string(Image.open('plt.png'),lang = 'eng'))

Thank you!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/994ad827-8804-4f6f-89d7-6ff3348fc9e3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Reading Japanese Text (Kanji)

2017-06-07 Thread akshat garg
Hi, 

As a part of explorating tesseract, I was trying to read Japanese text 
using tesseract. 

I was able to read Katakana and Hiragana chart with certain degree of 
accuracy (around 60-70 %), but Kanji symbols remains a mystery. 

Any suggestions are welcomed. 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/96d7360a-645c-4216-89a5-1f4d0ed6314e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.