You can use Regex to transform your reference text to one non-space character per line. Vertically select them and copy to clipboard. Then in the box file, vertically select the characters and replace them with the clipboard content.
That's the trick I normally use in editing box files. Good programming editors like jEdit or Notepad++ usually support vertical selection of text. On Tuesday, March 11, 2014 2:03:55 AM UTC-5, Bernard Polarski wrote: > > I just mean to assert that the text is an absolute match of the image. You > have to check every box file, eventually split/merge/delete some boxes. > Once you have done it, I still compare the result using this simple cat > <file> | cut -c 1 | tr '\n' ' '. > The again I read every word until I am satisfied that the box file is > absolutely correct. I then store the image and the box file in a directory > to be used when I want to create a traineddata. I am creating various > directory of various type of font. But since version 3.03, for traineddata > create from scanned image, I have less impact. It does have effect, but I > have more negative impact for a good one. I am figthing hard to isolate one > single effect. For the moment the best results are obtained by cleaning the > FRA dictionary from short words (2 letters) seldom used. Now I feel the > needs to setup regressions tests over 20 certified box/text in order to > measure the impact of one single change. > > Working in progress and ABBY is already off but I hope more progresses > before submitting to my group. > > Le mardi 11 mars 2014 00:08:34 UTC+1, Quan Nguyen a écrit : >> >> Bernard, >> >> What do you mean by "assert a text box of 200 words"? Can you elaborate? >> Thanks. >> >> Quan >> >> On Monday, March 10, 2014 11:06:18 AM UTC-5, Bernard Polarski wrote: >>> >>> >>> Since I have the source, I will recompile it this evening at home and >>> will let you know. >>> I takes an average of 30 min to assert a text box of 200 words using >>> JtessBoxEditor. >>> This is a real issue. >>> >>> Le lundi 10 mars 2014 13:31:39 UTC+1, zdenop a écrit : >>> >>>> I did not run QBE on windows for a long time. >>>> Try this (QBE+depends)[1] - I run it on win7 pro 64bit (even app&libs >>>> are 32bit, build with mingw 4.8, leptonica 1.70 a tesseract 3.03rc1) >>>> >>>> [1] http://www.sk-spell.sk.cx/tmp/qtb-1.11.1.ZIP >>>> >>>> Zdenko >>>> >>>> >>>> On Mon, Mar 10, 2014 at 7:21 AM, Bernard Polarski <[email protected]>wrote: >>>> >>>>> I downloaded QBE and the additionals liraries, but it does not start >>>>> on my Windows Seven. Just get the message that the application ceased to >>>>> function and windows has to close it. >>>>> >>>>> >>>>> Le dimanche 9 mars 2014 21:19:23 UTC+1, zdenop a écrit : >>>>>> >>>>>> If I understood you correctly - You would like to have something >>>>>> like this: >>>>>> >>>>>> tesseract lm-110.jpg lm-110 -l fra makebox >>>>>> >>>>>> >>>>>> that creates box file and then some tool that will replace >>>>>> symbol(text) part of box file with content of e.g. lm-110.txt (certified >>>>>> text)? I did this with QBE[1]. But there are some (QBE) limitations: >>>>>> >>>>>> - there must be one symbol per box >>>>>> - number of boxes must be the same as count of symbols in your >>>>>> text file (without spaces) >>>>>> >>>>>> So my workflow was something like this: >>>>>> >>>>>> 1. create box file (or open image in QBE - it will offer you to >>>>>> create box file) >>>>>> 2. remove unnecessary boxes (heading, footer, page numbers, scan >>>>>> relics...) >>>>>> 3. split multisymbol boxes (e.g in one box file there was more >>>>>> symbols) >>>>>> 4. import text from external file (QBE->File->Import...->Import >>>>>> text file) >>>>>> >>>>>> It still needs user interaction (no automatic), but it can help, if >>>>>> you need something like that. >>>>>> >>>>>> [1] https://github.com/zdenop/qt-box-editor >>>>>> >>>>>> Zdenko >>>>>> >>>>>> >>>>>> On Sat, Mar 8, 2014 at 7:47 PM, Bernard Polarski >>>>>> <[email protected]>wrote: >>>>>> >>>>>>> Let me summarize what I am doing and what I am trying to achieve. >>>>>>> >>>>>>> Tesseract is excellent when it comes to recognize binaries fonts >>>>>>> (fonts that comes from computer, printed or directly generated from >>>>>>> an application). >>>>>>> >>>>>>> The match is a near perfect and many times it is perfect. >>>>>>> And it is easy now to train a text for one zillion fonts when it >>>>>>> comes to binaries font: >>>>>>> >>>>>>> text2image --text=$FIN --outputbase=$FOUT --fonts_dir=$FONT_DIR >>>>>>> --render_per_font --find_fonts >>>>>>> >>>>>>> This will generates one zillion fonts. This is a big plus from >>>>>>> version 3.03. But honestly this job has been done at Google. >>>>>>> >>>>>>> But training out of binaries fonts are deceiving when they are >>>>>>> applied on printed fonts, specially for books from the 19e century. >>>>>>> I belong to a group that edit epub for books of 19e century. >>>>>>> That kind of books comes in collections, and the collections were >>>>>>> often printed on the same machine. >>>>>>> >>>>>>> So instead of creating a library of 'Century old school' font, I am >>>>>>> exploring the idea of creating a font dedicated to an editor for a >>>>>>> given period. >>>>>>> ie *'*EFlammarion1870.ttf' to be used on these books. >>>>>>> >>>>>>> I do have enough plenty scripts to automatically generates a >>>>>>> traineddata file, starting from a directory containing img.tif file and >>>>>>> their img.box. >>>>>>> But it is very time consuming to generate every one of these box >>>>>>> file. >>>>>>> >>>>>>> The idea is to start from a set of scanned image, grabs a certified >>>>>>> text >>>>>>> from site like Gutenberg ( for french ebooksgratuits.com provides >>>>>>> more books). >>>>>>> A search string on the first 3 words in the certified text and here >>>>>>> is the needed certified translation. >>>>>>> >>>>>>> So I am looking now looking for a method to transform the certified >>>>>>> text into box file. >>>>>>> Doing this for some pages in order to generates quickly a new >>>>>>> traineddata and test it. >>>>>>> In this respect, it is clear that JTessBoxEditor, which is very >>>>>>> good but the process >>>>>>> to generate the box file is too slow and not prone to errors. >>>>>>> >>>>>>> >>>>>>> Here is a page extracted from "La maison nucingen" whose print is >>>>>>>> quite bad, so it is interresting. >>>>>>>> >>>>>>> >>>>>>> >>>>>>>> http://gallica.bnf.fr/ark:/12148/bpt6k58135211/f107. >>>>>>>> image.r=la%20maison%20nucingen.langEN >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> <https://lh4.googleusercontent.com/-7xPLX_2HR54/UxtWUEx8nBI/AAAAAAAAAB4/ro0vwKP0Oh4/s1600/lm-110.tif> >>>>>>> >>>>>>> >>>>>>> The text : >>>>>>> proposait d’opérer avec ses millions faits d’une >>>>>>> main de papier rose à l’aide d’une pierre litho- >>>>>>> graphique, de jolies petites actions à placer, pré- >>>>>>> cieusement conservées dans son cabinet. Les ac- >>>>>>> tions réelles allaient servir à fonder l’affaire, >>>>>>> acheter un magnifique hôtel et commencer les >>>>>>> opérations. Nucingen se trouvait encore des ac- >>>>>>> tions dans je ne sais quelles mines de plomb ar- >>>>>>> gentifère, dans des mines de houille et dans deux >>>>>>> canaux, actions bénéficiaires accordées pour la >>>>>>> mise en scène de ces quatre entreprises en pleine >>>>>>> activité, supérieurement montées et en faveur, au >>>>>>> moyen du dividende pris sur le capital. Nucin- >>>>>>> gen pouvait compter sur un agio si les actions >>>>>>> montaient, mais le baron le négligea dans ses >>>>>>> calculs, il le laissait à fleur d’eau, sur la place, >>>>>>> afin d’attirer les poissons ! Il avait donc massé >>>>>>> ses valeurs, comme Napoléon massait ses trou- >>>>>>> piers, afin de liquider durant la crise qui se des- >>>>>>> sinait et qui révolutionna, en 26 et 27 les places >>>>>>> européennes. S’il avait eu son prince de Wagram, >>>>>>> il aurait pu dire comme Napoléon du haut du >>>>>>> Santon : « Examinez bien la place, tel jour, à telle >>>>>>> heure, il y aura là des fonds répandus ! » Mais à >>>>>>> qui pouvait-il se confier ? Du Tillet ne soupçonna >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To post to this group, send email to [email protected] >>>>>>> >>>>>>> To unsubscribe from this group, send email to >>>>>>> [email protected] >>>>>>> >>>>>>> For more options, visit this group at >>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en >>>>>>> >>>>>>> --- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> -- >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To post to this group, send email to [email protected] >>>>> To unsubscribe from this group, send email to >>>>> [email protected] >>>>> For more options, visit this group at >>>>> http://groups.google.com/group/tesseract-ocr?hl=en >>>>> >>>>> --- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.

