[tesseract-ocr] Recognizing censorship blocks

Patrick Durusau Wed, 17 Dec 2014 14:28:26 -0800

Greetings!

I recently had wonderful success with tesseract-ocr on grand jury 
transcripts but now have a harder problem.


Can tesseract be trained to recognize censoring blocks in text? For example:

Assume this sentence has XXXXXXXXXXXXX a censoring block that obscures all 
the text it covers. (here represented by the X's, in the text, it is a 
solid black line)

What I want to do, in addition to recognizing the surrounding text, is to 
train tesseract to substitute for the black mark, (redaction - N) where N 
is the length of the redaction. 

There aren't that many different sized redactions, well, probably from one 
character space or a little better up to an entire line so producing 
examples of all the blackouts would be tedious but not difficult. 

Is that pushing tesseract in a direction it is not meant to go? 

If so, any suggestions on software that might be better suited to the task?

Thanks!

Patrick

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8eeb916e-95ca-48e5-a4b5-f078f32b0ad1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Recognizing censorship blocks

Reply via email to