If I understood you correctly - You would like to have something like this:

tesseract lm-110.jpg lm-110 -l fra makebox


that creates box file and then some tool that will replace symbol(text)
part of box file with content of e.g. lm-110.txt (certified text)? I did
this with QBE[1]. But there are some (QBE) limitations:

   - there must be one symbol per box
   - number of boxes must be the same as count of symbols in your text file
   (without spaces)

 So my workflow was something like this:

   1. create box file (or open image in QBE - it will offer you to create
   box file)
   2. remove unnecessary boxes (heading, footer, page numbers, scan
   relics...)
   3. split multisymbol boxes (e.g in one box file there was more symbols)
   4. import text from external file (QBE->File->Import...->Import text
   file)

It still needs user interaction (no automatic), but it can help, if you
need something like that.

[1] https://github.com/zdenop/qt-box-editor

Zdenko


On Sat, Mar 8, 2014 at 7:47 PM, Bernard Polarski <[email protected]> wrote:

> Let me summarize what I am doing and what I am trying to achieve.
>
> Tesseract is excellent when it comes to recognize binaries fonts
> (fonts that comes from computer, printed or directly generated from an
> application).
>
> The match is a near perfect and many times it is perfect.
> And it is easy now to train a text for one zillion fonts when it comes to
> binaries font:
>
>    text2image --text=$FIN  --outputbase=$FOUT  --fonts_dir=$FONT_DIR
> --render_per_font --find_fonts
>
> This will generates one zillion fonts. This is a big plus from version
> 3.03. But honestly this job has been done at Google.
>
> But training out of binaries fonts are deceiving when they are applied on
> printed fonts, specially for books from the 19e century.
> I belong to a group that edit epub for books of 19e century.
> That kind of books comes in collections, and the collections were often
> printed on the same machine.
>
> So instead of creating a library of 'Century old school' font, I am
> exploring the idea of creating a font dedicated to an editor for a given
> period.
> ie *'*EFlammarion1870.ttf' to be used on these books.
>
> I do have enough plenty scripts to automatically generates a traineddata
> file, starting from a directory containing img.tif file and their img.box.
> But it is very time consuming to generate every one of these box file.
>
> The idea is to start from a set of scanned image, grabs a certified text from
> site like Gutenberg ( for french ebooksgratuits.com provides more books).
> A search string on the first 3 words in the certified text and here is the
> needed certified translation.
>
> So I am looking now looking for a method to transform the certified text
> into box file.
>  Doing this for some pages in order to generates quickly a new
> traineddata and test it.
> In this respect, it is clear that JTessBoxEditor, which is very good but
> the process
> to generate the box file is too slow and not prone to errors.
>
>
> Here is a page extracted from "La maison nucingen" whose print is quite
>> bad, so it is interresting.
>>
>
>
>>
>> http://gallica.bnf.fr/ark:/12148/bpt6k58135211/f107.image.r=la%20maison%20nucingen.langEN
>>
>
>
>
> <https://lh4.googleusercontent.com/-7xPLX_2HR54/UxtWUEx8nBI/AAAAAAAAAB4/ro0vwKP0Oh4/s1600/lm-110.tif>
>
>
> The text :
> proposait d’opérer avec ses millions faits d’une
> main de papier rose à l’aide d’une pierre litho-
> graphique, de jolies petites actions à placer, pré-
> cieusement conservées dans son cabinet. Les ac-
> tions réelles allaient servir à fonder l’affaire,
> acheter un magnifique hôtel et commencer les
> opérations. Nucingen se trouvait encore des ac-
> tions dans je ne sais quelles mines de plomb ar-
> gentifère, dans des mines de houille et dans deux
> canaux, actions bénéficiaires accordées pour la
> mise en scène de ces quatre entreprises en pleine
> activité, supérieurement montées et en faveur, au
> moyen du dividende pris sur le capital. Nucin-
> gen pouvait compter sur un agio si les actions
> montaient, mais le baron le négligea dans ses
> calculs, il le laissait à fleur d’eau, sur la place,
> afin d’attirer les poissons ! Il avait donc massé
> ses valeurs, comme Napoléon massait ses trou-
> piers, afin de liquider durant la crise qui se des-
> sinait et qui révolutionna, en 26 et 27 les places
> européennes. S’il avait eu son prince de Wagram,
> il aurait pu dire comme Napoléon du haut du
> Santon : « Examinez bien la place, tel jour, à telle
> heure, il y aura là des fonds répandus ! » Mais à
> qui pouvait-il se confier ? Du Tillet ne soupçonna
>
>
>
>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to