Re: Create boxfile from a certified text

Quan Nguyen Tue, 11 Mar 2014 18:46:34 -0700

You can use Regex to transform your reference text to one non-space 
character per line. Vertically select them and copy to clipboard. Then in 
the box file, vertically select the characters and replace them with the 
clipboard content.


That's the trick I normally use in editing box files. Good programming 
editors like jEdit or Notepad++ usually support vertical selection of text.

On Tuesday, March 11, 2014 2:03:55 AM UTC-5, Bernard Polarski wrote:
>
> I just mean to assert that the text is an absolute match of the image. You 
> have to check every box file, eventually split/merge/delete some boxes. 
> Once you have done it, I still compare the result using this simple cat 
> <file> | cut -c 1 | tr '\n' ' '.
> The again I read every word until I am satisfied that the box file is 
> absolutely correct. I then store the image and the box file in a directory 
> to be used when I want to create a traineddata. I am creating various 
> directory of various type of font. But since version 3.03, for traineddata 
> create from scanned image,  I have less impact. It does have effect, but I 
> have more negative impact for a good one. I am figthing hard to isolate one 
> single effect. For the moment the best results are obtained by cleaning the 
> FRA dictionary from short words (2 letters) seldom used. Now I feel the 
> needs to setup regressions tests over 20 certified box/text in order to 
> measure the impact of one single change.
>
> Working in progress and ABBY is already off but I hope more progresses 
> before submitting to my group.
>
> Le mardi 11 mars 2014 00:08:34 UTC+1, Quan Nguyen a écrit :
>>
>> Bernard,
>>
>> What do you mean by "assert a text box of 200 words"? Can you elaborate? 
>> Thanks.
>>
>> Quan
>>
>> On Monday, March 10, 2014 11:06:18 AM UTC-5, Bernard Polarski wrote:
>>>
>>>
>>> Since I have the source, I will recompile it this evening at home and 
>>> will let you know.
>>> I takes an average of 30 min to assert a text box of 200 words using 
>>> JtessBoxEditor. 
>>> This is a real issue.
>>>  
>>> Le lundi 10 mars 2014 13:31:39 UTC+1, zdenop a écrit :
>>>
>>>> I did not run QBE on windows for a long time. 
>>>> Try this (QBE+depends)[1] - I run it on win7 pro 64bit (even app&libs 
>>>> are 32bit, build with mingw 4.8, leptonica 1.70 a tesseract 3.03rc1) 
>>>>
>>>> [1] http://www.sk-spell.sk.cx/tmp/qtb-1.11.1.ZIP
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> On Mon, Mar 10, 2014 at 7:21 AM, Bernard Polarski <[email protected]>wrote:
>>>>
>>>>> I downloaded QBE and the additionals liraries, but it does not start 
>>>>> on my Windows Seven. Just get the message that the application ceased to 
>>>>> function and windows has to close it. 
>>>>>
>>>>>
>>>>> Le dimanche 9 mars 2014 21:19:23 UTC+1, zdenop a écrit : 
>>>>>>
>>>>>>  If I understood you correctly - You would like to have something 
>>>>>> like this: 
>>>>>>
>>>>>>  tesseract lm-110.jpg lm-110 -l fra makebox
>>>>>>
>>>>>>
>>>>>> that creates box file and then some tool that will replace 
>>>>>> symbol(text) part of box file with content of e.g. lm-110.txt (certified 
>>>>>> text)? I did this with QBE[1]. But there are some (QBE) limitations:
>>>>>>  
>>>>>>    - there must be one symbol per box  
>>>>>>    - number of boxes must be the same as count of symbols in your 
>>>>>>    text file (without spaces)
>>>>>>
>>>>>>  So my workflow was something like this:
>>>>>>  
>>>>>>    1. create box file (or open image in QBE - it will offer you to 
>>>>>>    create box file)
>>>>>>    2. remove unnecessary boxes (heading, footer, page numbers, scan 
>>>>>>    relics...) 
>>>>>>    3. split multisymbol boxes (e.g in one box file there was more 
>>>>>>    symbols) 
>>>>>>    4. import text from external file (QBE->File->Import...->Import 
>>>>>>    text file)
>>>>>>
>>>>>> It still needs user interaction (no automatic), but it can help, if 
>>>>>> you need something like that.
>>>>>>
>>>>>> [1] https://github.com/zdenop/qt-box-editor
>>>>>>  
>>>>>>  Zdenko
>>>>>>
>>>>>>
>>>>>>  On Sat, Mar 8, 2014 at 7:47 PM, Bernard Polarski 
>>>>>> <[email protected]>wrote:
>>>>>>
>>>>>>>   Let me summarize what I am doing and what I am trying to achieve.
>>>>>>>
>>>>>>> Tesseract is excellent when it comes to recognize binaries fonts 
>>>>>>> (fonts that comes from computer, printed or directly generated from 
>>>>>>> an application). 
>>>>>>>
>>>>>>> The match is a near perfect and many times it is perfect. 
>>>>>>> And it is easy now to train a text for one zillion fonts when it 
>>>>>>> comes to binaries font:
>>>>>>>
>>>>>>>    text2image --text=$FIN  --outputbase=$FOUT  --fonts_dir=$FONT_DIR 
>>>>>>> --render_per_font --find_fonts
>>>>>>>
>>>>>>> This will generates one zillion fonts. This is a big plus from 
>>>>>>> version 3.03. But honestly this job has been done at Google.
>>>>>>>
>>>>>>> But training out of binaries fonts are deceiving when they are 
>>>>>>> applied on printed fonts, specially for books from the 19e century.
>>>>>>> I belong to a group that edit epub for books of 19e century.
>>>>>>> That kind of books comes in collections, and the collections were 
>>>>>>> often printed on the same machine.
>>>>>>>
>>>>>>> So instead of creating a library of 'Century old school' font, I am 
>>>>>>> exploring the idea of creating a font dedicated to an editor for a 
>>>>>>> given period. 
>>>>>>> ie *'*EFlammarion1870.ttf' to be used on these books.
>>>>>>>
>>>>>>> I do have enough plenty scripts to automatically generates a 
>>>>>>> traineddata file, starting from a directory containing img.tif file and 
>>>>>>> their img.box.
>>>>>>> But it is very time consuming to generate every one of these box 
>>>>>>> file.
>>>>>>>
>>>>>>> The idea is to start from a set of scanned image, grabs a certified 
>>>>>>> text 
>>>>>>> from site like Gutenberg ( for french ebooksgratuits.com provides 
>>>>>>> more books).
>>>>>>> A search string on the first 3 words in the certified text and here 
>>>>>>> is the needed certified translation.
>>>>>>>
>>>>>>> So I am looking now looking for a method to transform the certified 
>>>>>>> text into box file. 
>>>>>>> Doing this for some pages in order to generates quickly a new 
>>>>>>> traineddata and test it.
>>>>>>> In this respect, it is clear that JTessBoxEditor, which is very 
>>>>>>> good but the process 
>>>>>>> to generate the box file is too slow and not prone to errors.
>>>>>>>
>>>>>>>
>>>>>>>  Here is a page extracted from "La maison nucingen" whose print is 
>>>>>>>> quite bad, so it is interresting.
>>>>>>>>
>>>>>>>  
>>>>>>>
>>>>>>>> http://gallica.bnf.fr/ark:/12148/bpt6k58135211/f107.
>>>>>>>> image.r=la%20maison%20nucingen.langEN
>>>>>>>>
>>>>>>>  
>>>>>>>
>>>>>>>
>>>>>>> <https://lh4.googleusercontent.com/-7xPLX_2HR54/UxtWUEx8nBI/AAAAAAAAAB4/ro0vwKP0Oh4/s1600/lm-110.tif>
>>>>>>>
>>>>>>>
>>>>>>> The text :
>>>>>>> proposait d’opérer avec ses millions faits d’une
>>>>>>> main de papier rose à l’aide d’une pierre litho-
>>>>>>> graphique, de jolies petites actions à placer, pré-
>>>>>>> cieusement conservées dans son cabinet. Les ac-
>>>>>>> tions réelles allaient servir à fonder l’affaire,
>>>>>>> acheter un magnifique hôtel et commencer les
>>>>>>> opérations. Nucingen se trouvait encore des ac-
>>>>>>> tions dans je ne sais quelles mines de plomb ar-
>>>>>>> gentifère, dans des mines de houille et dans deux
>>>>>>> canaux, actions bénéficiaires accordées pour la 
>>>>>>> mise en scène de ces quatre entreprises en pleine
>>>>>>> activité, supérieurement montées et en faveur, au
>>>>>>> moyen du dividende pris sur le capital. Nucin-
>>>>>>> gen pouvait compter sur un agio si les actions 
>>>>>>> montaient, mais le baron le négligea dans ses 
>>>>>>> calculs, il le laissait à fleur d’eau, sur la place, 
>>>>>>> afin d’attirer les poissons ! Il avait donc massé 
>>>>>>> ses valeurs, comme Napoléon massait ses trou-
>>>>>>> piers, afin de liquider durant la crise qui se des-
>>>>>>> sinait et qui révolutionna, en 26 et 27 les places 
>>>>>>> européennes. S’il avait eu son prince de Wagram, 
>>>>>>> il aurait pu dire comme Napoléon du haut du 
>>>>>>> Santon : « Examinez bien la place, tel jour, à telle 
>>>>>>> heure, il y aura là des fonds répandus ! » Mais à 
>>>>>>> qui pouvait-il se confier ? Du Tillet ne soupçonna 
>>>>>>>
>>>>>>>
>>>>>>>  
>>>>>>>   
>>>>>>> -- 
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To post to this group, send email to [email protected] 
>>>>>>>
>>>>>>> To unsubscribe from this group, send email to
>>>>>>> [email protected] 
>>>>>>>
>>>>>>> For more options, visit this group at
>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>>>>>
>>>>>>> --- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected]. 
>>>>>>>
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>  -- 
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To post to this group, send email to [email protected]
>>>>> To unsubscribe from this group, send email to
>>>>> [email protected]
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>>>
>>>>> --- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Create boxfile from a certified text

Reply via email to