Okay, I see. Very interesting articles, thank you. Since I don't know any 
other method for line segmentation I used hocr output from tesseract than I 
used hocr-tools, I dug that out on some older GitHub issues and that's how 
I generated line images for ground truth. Than I manually checked about 
500-800 files and trained with them. There are lots of "misses" with line 
segmentation, with 2 to 4 lines being "cut" as a line image, so I corrected 
all of them. I also used those big "drop-caps" too, as a line image, but no 
many of them.

I never did anything like this I'm sure I made some mistakes, since the OCR 
quality barely improved and the error rate won't go below 0.5 - 0.3%. 
Images are scans of old books from 1800s in TIF format with 231DPI, 
grayscale, dual pages. Some of them are skewed slightly, which I tried to 
correct with so many different methods and there's always a drawback. 
That's only a part of the text skewed, mind you, as well as mixed with page 
skew. Pretty difficult to serialize through some software. I also tried 
textcleaner as well as manual Image magick tools for binarization and 
resampling to 400-600DPI, with that resampling being of the most useful 
things I tried. (auto-threshold)OTSU destroys/degrades the image quality 
too much, font doesn't have any sharpness and it loses parts of the 
letters,  Kapur is much better, but it's inconsistent and also slightly 
loses some font precision, but the images that have darker spots get 
basically all black with Kapur.

Filip

On Wednesday, August 5, 2020 at 6:12:32 PM UTC+2 [email protected] wrote:

> The technical term for these is "drop-caps 
> <https://en.wikipedia.org/wiki/Initial>," which is useful to know if you 
> want to Google for it.
>
> It's pretty dated now, but Ray's 2007 description 
> <https://tesseract-ocr.github.io/docs/tesseracticdar2007.pdf> of the line 
> finding algorithm says: "Assuming that page layout analysis has already 
> provided text regions of a roughly uniform text size, a simple percentile 
> height filter *removes drop-caps* and vertically touching characters." 
> [Emphasis added]
>
> It looks like the commercial package Omnipage supports drop caps. Teaching 
> Tesseract to recognize them would involve tweaking the internal 
> segmentation and line finding algorithms, not additional training. Another 
> approach would be to do your own segmentation to identify them and 
> recognize them separately as single letters.
>
> There's some general background which may be interesting/useful here: 
> https://how-ocr-works.com/OCR/line-segmentation.html
>
> Tom
>
>
> On Wednesday, August 5, 2020 at 4:58:20 AM UTC-4 [email protected] wrote:
>
>> That's right, that initial "TO" and this is just a fraction of the text, 
>> there are dozens of examples like "TO" on a single page. But since it 
>> spreads to two lines there's nothing I can do I assume?
>>
>> On Tuesday, August 4, 2020 at 7:39:21 PM UTC+2 zdenop wrote:
>>
>>> Not sure what do you mean...
>>>
>>> tesseract big_low.jpeg - --psm 6
>>> Warning: Invalid resolution 0 dpi. Using 70 instead.
>>> FY, MINERS.—TO LET, ON LEASE, on such terms as may
>>> be agreed on, the MINERALS in the ESTATE of KNOCKSHINNOCK, lying in
>>> the parish of New Cumnock, and county of Ayr. Acdead vein has been 
>>> lately discovered
>>>
>>> Problem is there only with initial TO which is IMO caused by T with size 
>>> of two lines with following smaller size letters.
>>>
>>> Zdenko
>>>
>>>
>>> ut 4. 8. 2020 o 13:07 [email protected] <[email protected]> napísal(a):
>>>
>>>> Hello,
>>>>
>>>> Is it possible to train for bigger fonts in the beginning of the 
>>>> sentences, since it seems that tesseract always misses them.
>>>>
>>>> Thanks in advance.
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/0f97a784-e8e4-4c05-8296-b95dc2211e78n%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/0f97a784-e8e4-4c05-8296-b95dc2211e78n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d885e3be-ca6b-451c-b46d-dc8a89ac66d9n%40googlegroups.com.

Reply via email to