OK, so I whipped up a program that uses Pango to get character 
metrics information for a given font, of the sort that is useful for 
Tesseract's unicharset file.

It takes a file with UTF-8 characters separated by newlines, and a 
font description (in the same format as you provide to text2image; 
pango's "font description" format). It outputs the character, 
followed by the bottom, top, width, bearing, and advance values, 
roughly calibrated to the co-ordinate system Tesseract uses.

This could be the basis for a tool that takes all the different 
fonts used and gets the minimums and maximums for each value, but 
first we should compare it to the sorts of values in the official
unicharset files to look for discrepancies.

It is very very provisional; the output seems to be sensible from 
light testing, but it's intended more as a base for further testing 
and questioning than as a finished tool. Oh, and there will be bugs, 
and you can probably crash it. Also it gives you no indication of 
whether the asked for font was loaded... Again, it's a proof-of-
concept; something to work with.

Attached is the code, plus the chars file for eng to play around 
with.

Example runs:
./charmetrics eng.unicharset.chars 'Linux Libertine' | head -n 3
I 63 192 70 3 73
' 173 192 22 14 36
v 61 137 126 1 127

./charmetrics eng.unicharset.chars 'DejaVu Sans' | head -n 3
I 64 205 25 25 50
' 182 205 21 25 46
v 64 158 136 8 144

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/20140711031355.GA30606%40manta.lan.
For more options, visit https://groups.google.com/d/optout.
I
'
v
e
J
o
i
n
|
-
S
z
:
#
6
%
5
0
@
p
a
r
m
F
u
s
B
»
f
d
c
h
C
t
L
?
T
M
y
R
l
~
<
®
N
b
k
[
«
1
,
.
”
g
H
$
(
+
D
w
V
£
4
9
Q
&
A
P
¢
]
3
2
©
8
/
>
X
é
j
;
7
€
O
¥
U
x
}
E
§
=
!
’
G
)
Z
q
{
“
—
Y
K
*
W
"
\
°
fi
‘
_
fl
/*
 * Copyright 2014 Nick White <[email protected]>
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Build with something like this:
 * cc `pkg-config --cflags --libs pangocairo` charmetrics.c -o charmetrics
 */

#define usage "charmetrics - calculates some metrics useful for a unicharset file\n" \
              "usage: charmetrics chars.txt fontname\n"

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <pango/pangocairo.h>

#define FONTSIZE 256 /* Yields an appropriate sized character for the 256x256 square */
#define BASELINE_NORMALISE 64

#define MAXCHARBYTES 24 /* Tesseract has this limit, IIRC */
#define MAXCHARS 16384  /* Chosen arbitrarily as "big enough" */
#define MINZERO(x) ((x) > 0 ? (x) : 0)

int main(int argc, char *argv[]) {
	char c[MAXCHARBYTES];
	char chs[MAXCHARS][MAXCHARBYTES];
	unsigned int chnum, i;
	FILE *f;
	int baseline;
	PangoFontDescription *font_description;
	PangoRectangle rect;
	cairo_surface_t *surface;
	cairo_t *cr;
	PangoLayout *layout;

	if(argc != 3) {
		fputs(usage, stdout);
		return 1;
	}

	if((f = fopen(argv[1], "r")) == NULL) {
		fprintf(stderr, "Can't open char file: %s\n", argv[1]);
		return 1;
	}
	chnum = 0;
	while(fgets(c, MAXCHARBYTES, f) != NULL) {
		c[strlen(c) - 1] = '\0'; /* remove newline */
		if(chnum < MAXCHARS) {
			strncpy(chs[chnum], c, MAXCHARBYTES);
			chnum++;
		}
	}
	fclose(f);

	font_description = pango_font_description_from_string(argv[2]);
	pango_font_description_set_absolute_size(font_description, FONTSIZE * PANGO_SCALE);

	surface = cairo_image_surface_create(CAIRO_FORMAT_ARGB32, 0, 0);
	cr = cairo_create(surface);
	layout = pango_cairo_create_layout(cr);
	pango_layout_set_font_description(layout, font_description);
	pango_font_description_free(font_description);

	baseline = (pango_layout_get_baseline(layout) / PANGO_SCALE) + BASELINE_NORMALISE;

	for(i = 0; i < chnum; i++) {
		pango_layout_set_text(layout, chs[i], -1);
		pango_layout_get_pixel_extents(layout, &rect, NULL);

		printf("%s %d %d %d %d %d\n",
		       chs[i],
		       MINZERO(baseline - (rect.y + rect.height)),  /* bottom */
		       MINZERO(256 - rect.y),                       /* top */
		       rect.width,                                  /* width */
		       MINZERO(PANGO_LBEARING(rect)),               /* bearing */
		       MINZERO(PANGO_RBEARING(rect))                /* advance */
		      );
	}

	g_object_unref(layout);
	cairo_destroy(cr);
	cairo_surface_destroy(surface);

	return 0;
}

Reply via email to