I think you need to deskew/dewarp the lines, increase brighness, get the
imaes at 300dpi and try.

I tested using your images with vietocr (4.0 beta) with the following
output ...

----------------------

East 133rd Street, cast from Cypress Ave. In the background is
the United Electric Light and Power Co. plant on the East River Shore.

April 12, 1931.
P. L. Sperr.
NO REPRODUCTIONS.

------------------
901 Harrie Ave., west aide, between East lGlet and East 162nd
Streets.

About 1925 .

W. B. Vernem.
MAY BE REPRODUCED.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Jan 2, 2015 at 3:03 AM, Dan Vanderkam <[email protected]> wrote:

> I'm not specifying psm explicitly, so it must be 3 = Fully automatic page
> segmentation, but no OSD. (Default)
>
> On Tuesday, December 30, 2014 11:10:05 PM UTC-5, shree wrote:
>>
>> what page segmentation mode are you using?
>>
>> https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Wed, Dec 31, 2014 at 6:18 AM, Dan Vanderkam <[email protected]> wrote:
>>
>>> More context here
>>> <http://stackoverflow.com/questions/27592430/how-can-i-tell-tesseract-that-my-font-has-a-particular-size>.
>>> I'm trying to get Tesseract to split some of its detected boxes in half or
>>> thirds.
>>>
>>> My approach has been to draw white vertical lines through the joined
>>> letters, so from before:
>>>
>>>
>>> to after:
>>>
>>>
>>> (http://i.imgur.com/TPcCsi0.png)
>>> If you can't see the lines, here they are in red:
>>>
>>> (http://i.imgur.com/MjSa0FS.png)
>>>
>>> I would have expected that drawing the white lines would split these
>>> boxes apart. It does that, but it also has a side effect: it joins the "9"
>>> on the first line with the "s" below it on the next line:
>>>
>>> even if I draw a white line below the "9" and the "0", this still
>>> happens. As you might expect, these tall letters wreak havoc on the
>>> resulting OCR'd text.
>>>
>>> I'm baffled why this is happening. Based on this SO answer
>>> <http://stackoverflow.com/a/27605797/388951>, my understanding was that
>>> Tesseract looked at connected components to find boxes, so I would have
>>> expected the white lines to force apart two components.
>>>
>>> Is it possible to give Tesseract an explicit list of boxes? If not, is
>>> there a more effective way to force apart two letters than what I'm doing?
>>>
>>> Thanks!
>>>   - Dan
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/tesseract-ocr/CAGiBXrzXUU9tC6MaKz89pugooXq31
>>> iDLQP1E3qr7d3s1CVgoxQ%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAGiBXrzXUU9tC6MaKz89pugooXq31iDLQP1E3qr7d3s1CVgoxQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/0668c869-535a-4dbc-ba02-e4b1c40f9fab%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/0668c869-535a-4dbc-ba02-e4b1c40f9fab%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWWrNKwQWWq07%2BBP2myms7DoAWWimrxF19Tg5uxvgQ42w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to