Bernard,

What do you mean by "assert a text box of 200 words"? Can you elaborate? 
Thanks.

Quan

On Monday, March 10, 2014 11:06:18 AM UTC-5, Bernard Polarski wrote:
>
>
> Since I have the source, I will recompile it this evening at home and will 
> let you know.
> I takes an average of 30 min to assert a text box of 200 words using 
> JtessBoxEditor. 
> This is a real issue.
>  
> Le lundi 10 mars 2014 13:31:39 UTC+1, zdenop a écrit :
>
>> I did not run QBE on windows for a long time. 
>> Try this (QBE+depends)[1] - I run it on win7 pro 64bit (even app&libs are 
>> 32bit, build with mingw 4.8, leptonica 1.70 a tesseract 3.03rc1) 
>>
>> [1] http://www.sk-spell.sk.cx/tmp/qtb-1.11.1.ZIP
>>
>> Zdenko
>>
>>
>> On Mon, Mar 10, 2014 at 7:21 AM, Bernard Polarski <[email protected]>wrote:
>>
>>> I downloaded QBE and the additionals liraries, but it does not start on 
>>> my Windows Seven. Just get the message that the application ceased to 
>>> function and windows has to close it. 
>>>
>>>
>>> Le dimanche 9 mars 2014 21:19:23 UTC+1, zdenop a écrit : 
>>>>
>>>>  If I understood you correctly - You would like to have something like 
>>>> this: 
>>>>
>>>>  tesseract lm-110.jpg lm-110 -l fra makebox
>>>>
>>>>
>>>> that creates box file and then some tool that will replace symbol(text) 
>>>> part of box file with content of e.g. lm-110.txt (certified text)? I did 
>>>> this with QBE[1]. But there are some (QBE) limitations:
>>>>  
>>>>    - there must be one symbol per box  
>>>>    - number of boxes must be the same as count of symbols in your text 
>>>>    file (without spaces)
>>>>
>>>>  So my workflow was something like this:
>>>>  
>>>>    1. create box file (or open image in QBE - it will offer you to 
>>>>    create box file)
>>>>    2. remove unnecessary boxes (heading, footer, page numbers, scan 
>>>>    relics...) 
>>>>    3. split multisymbol boxes (e.g in one box file there was more 
>>>>    symbols) 
>>>>    4. import text from external file (QBE->File->Import...->Import 
>>>>    text file)
>>>>
>>>> It still needs user interaction (no automatic), but it can help, if you 
>>>> need something like that.
>>>>
>>>> [1] https://github.com/zdenop/qt-box-editor
>>>>  
>>>>  Zdenko
>>>>
>>>>
>>>>  On Sat, Mar 8, 2014 at 7:47 PM, Bernard Polarski <[email protected]>wrote:
>>>>
>>>>>   Let me summarize what I am doing and what I am trying to achieve.
>>>>>
>>>>> Tesseract is excellent when it comes to recognize binaries fonts 
>>>>> (fonts that comes from computer, printed or directly generated from an 
>>>>> application). 
>>>>>
>>>>> The match is a near perfect and many times it is perfect. 
>>>>> And it is easy now to train a text for one zillion fonts when it comes 
>>>>> to binaries font:
>>>>>
>>>>>    text2image --text=$FIN  --outputbase=$FOUT  --fonts_dir=$FONT_DIR 
>>>>> --render_per_font --find_fonts
>>>>>
>>>>> This will generates one zillion fonts. This is a big plus from version 
>>>>> 3.03. But honestly this job has been done at Google.
>>>>>
>>>>> But training out of binaries fonts are deceiving when they are applied 
>>>>> on printed fonts, specially for books from the 19e century.
>>>>> I belong to a group that edit epub for books of 19e century.
>>>>> That kind of books comes in collections, and the collections were 
>>>>> often printed on the same machine.
>>>>>
>>>>> So instead of creating a library of 'Century old school' font, I am 
>>>>> exploring the idea of creating a font dedicated to an editor for a 
>>>>> given period. 
>>>>> ie *'*EFlammarion1870.ttf' to be used on these books.
>>>>>
>>>>> I do have enough plenty scripts to automatically generates a 
>>>>> traineddata file, starting from a directory containing img.tif file and 
>>>>> their img.box.
>>>>> But it is very time consuming to generate every one of these box file.
>>>>>
>>>>> The idea is to start from a set of scanned image, grabs a certified text 
>>>>> from site like Gutenberg ( for french ebooksgratuits.com provides 
>>>>> more books).
>>>>> A search string on the first 3 words in the certified text and here is 
>>>>> the needed certified translation.
>>>>>
>>>>> So I am looking now looking for a method to transform the certified 
>>>>> text into box file. 
>>>>> Doing this for some pages in order to generates quickly a new 
>>>>> traineddata and test it.
>>>>> In this respect, it is clear that JTessBoxEditor, which is very good 
>>>>> but the process 
>>>>> to generate the box file is too slow and not prone to errors.
>>>>>
>>>>>
>>>>>  Here is a page extracted from "La maison nucingen" whose print is 
>>>>>> quite bad, so it is interresting.
>>>>>>
>>>>>  
>>>>>
>>>>>> http://gallica.bnf.fr/ark:/12148/bpt6k58135211/f107.
>>>>>> image.r=la%20maison%20nucingen.langEN
>>>>>>
>>>>>  
>>>>>
>>>>>
>>>>> <https://lh4.googleusercontent.com/-7xPLX_2HR54/UxtWUEx8nBI/AAAAAAAAAB4/ro0vwKP0Oh4/s1600/lm-110.tif>
>>>>>
>>>>>
>>>>> The text :
>>>>> proposait d’opérer avec ses millions faits d’une
>>>>> main de papier rose à l’aide d’une pierre litho-
>>>>> graphique, de jolies petites actions à placer, pré-
>>>>> cieusement conservées dans son cabinet. Les ac-
>>>>> tions réelles allaient servir à fonder l’affaire,
>>>>> acheter un magnifique hôtel et commencer les
>>>>> opérations. Nucingen se trouvait encore des ac-
>>>>> tions dans je ne sais quelles mines de plomb ar-
>>>>> gentifère, dans des mines de houille et dans deux
>>>>> canaux, actions bénéficiaires accordées pour la 
>>>>> mise en scène de ces quatre entreprises en pleine
>>>>> activité, supérieurement montées et en faveur, au
>>>>> moyen du dividende pris sur le capital. Nucin-
>>>>> gen pouvait compter sur un agio si les actions 
>>>>> montaient, mais le baron le négligea dans ses 
>>>>> calculs, il le laissait à fleur d’eau, sur la place, 
>>>>> afin d’attirer les poissons ! Il avait donc massé 
>>>>> ses valeurs, comme Napoléon massait ses trou-
>>>>> piers, afin de liquider durant la crise qui se des-
>>>>> sinait et qui révolutionna, en 26 et 27 les places 
>>>>> européennes. S’il avait eu son prince de Wagram, 
>>>>> il aurait pu dire comme Napoléon du haut du 
>>>>> Santon : « Examinez bien la place, tel jour, à telle 
>>>>> heure, il y aura là des fonds répandus ! » Mais à 
>>>>> qui pouvait-il se confier ? Du Tillet ne soupçonna 
>>>>>
>>>>>
>>>>>  
>>>>>   
>>>>> -- 
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To post to this group, send email to [email protected] 
>>>>>
>>>>> To unsubscribe from this group, send email to
>>>>> [email protected] 
>>>>>
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>>>
>>>>> --- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected]. 
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  -- 
>>> -- 
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To post to this group, send email to [email protected]
>>> To unsubscribe from this group, send email to
>>> [email protected]
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>
>>> --- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to