Hi Holger
If there are plenty of s's and t's on the page, it is no problem to
skip a
couple. Another possible strategy is to create one box for both
letters
('st'), similar to what you've probably done for the 'ch' ligature.
>From a quick look at your image, I suspect you might get problems
with overlapping boxes when you try use it for training. ('lt' in
enthalten
looks problematic). But try running tesseract with the box.train
command and see what happens.
I've always used as many sets of .tif/.box-files as I had available.
For
the version of deu-frak available in the downloads section, I think it
was
about 25. For the newest version from
https://github.com/paalberti/tesseract-dan-fraktur/tree/master/deu-frak
it is 32, I think.
Best regards,
Peter
On 15 Maj, 23:56, stinguin <[email protected]> wrote:
> Hi all,
>
> I was diligent and build a new wordlist and some new box-files. Can
> you take a look on my boxes before I use them to create a new
> traineddata? Because there are different fonts and because of some
> letters are to close to seperate them (e.g. 's' next to 't') I
> couldn't make a box for each letter as you can see here:
>
> http://s1.directupload.net/file/d/2525/kheno9zf_jpg.htm
>
> Is it bad that I "ignore" some characters of the original page or is
> it OK? Would it be better to use a bitonal scan? And what's better,
> slim boxes or boxes with some space around the letters?
>
> Many thanks in advance (as usual;-)
>
> Holger
>
> @ Peter: Can you tell me of how many boxfiles the official deu-frak-
> language consist. Only of the 8 deu-frak ones?
>
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en