Hi, I am trying to do what you did. I noticed that this topic is reeeeeaaaally old (7 years !).
But I could not download your script files script1.pl and correlatebox.pl <http://correlatebox.pl> (404 not found error). Do you still have them anywhere ? Thanks in advance Le dimanche 28 octobre 2007 11:29:10 UTC+1, begemotv2718 a écrit : > > http://tesseract-ocr.googlegroups.com/web/latex_train_kannada.tgz > > Some instructions. > To experiment with all this on Unix system you need to have some > packages installed. > First of all, you need itrans package, which allows to typeset kannada > language from the transliterated input file. > If you type > your_machine>sudo apt-get install itrans itrans-fonts itrans-doc > this will install itrans package and some latex package, since itrans > depends on it. > > Secondly, you need Font::TFM perl package. Unfortunately it is not in > the standard distribution, so you need to run > your_machine>sudo cpan install Font::TFM > cpan will ask you several questions for which you may give the default > answers and it will finally install the Font::TFM package. > You may want to install several dependencies to the Font::TFM package > via apt-get > your_machine>sudo apt-get install libparse-yapp-perl libio-pty-perl > libdate-manip-perl libxml-dom-xpath-perl > before running cpan. > I am sorry for this inconvenience. Currently I am trying to rewrite > this part of my code in C, this will be easier then. > > You may also need texlive-extra-utils package : > sudo apt-get install texlive-extra-utils. > This will provide dvitype program. > > How to use all this > First of all prepare *.itx file with transliterated text. I included > two sample files for you, one contain smth like alphabet and the other > some sample of poetry that I found in the itrans documentation. The > transliteration scheme in this file should be similar to those that > your Baraha software use. > > Then you process your file with itrans > > itrans <sample.itx >sample.tex > > Then you process it with latex to get the dvi file > > latex sample.tex > > After that you will have sample.dvi file. You need to open it with > xdvi sample.dvi > in order to have font files generated. > > Then you obtain the texbox file, that contain the boundary boxes for > all characters as latex typesetted it. > dvitype sample.dvi | perl scripts/kannada.pl > sample.texbox > > Than you produce tif file for tesseract > > dvips -o sample.ps sample.dvi > gs -r300x300 -sOutputFile=sample.tif -sDEVICE=tiffg4 -dNOPAUSE > sample.ps quit.ps > > Now you run tesseract to produce its own box files > tesseract sample.tif sample batch.nochop makebox > > Then you run my second program to get final file to train tesseract: > perl scripts/correlatebox.pl sample.texbox sample.txt >sample.box > > If you have imagemagick installed on your system, you may want to run > another script > perl scripts/draw2.pl sample.tif sample_dir sample.box >sample.html > which will produce the html file that you can open in your browser, > this file will contain pictures of characters cutted out from the tif > file, as well as representation of this characters. > > Well and final training steps is usual > tesseract sample.tif junk nobatch box.train > mftrain sample.tr > cntrain sample.tr > etc. > > On Oct 28, 4:39 am, begemotv2718 <[email protected]> wrote: > > Well, actually I run not Ubuntu but Debian Linux (which is like a > > parent for Ubuntu) as well as Mac OS on the laptop computer. Most > > things does not depend much of which Unix-like system you use. By the > > way, you may try to install Cygwin on your Windows box, this is the > > easiest way to turn Windows machine into (almost) fully functional > > Unix without sacrificing anything from Windows itself. > > > > I experimented a little bit with Kannada language on the samples from > > itrans package. The machinery seems to work OK. However, to further > > work on this I need a person who knows the language. > > > > I am going to download the archive with some samples now. > > > > On Oct 25, 4:27 am, "74yrs old" <[email protected]> wrote: > > > > > Thanks for the detailed procedure . It appears you are using in Ubuntu > > > LinuxOS > > > and if so, is it possible to forward copy of the typescript generated > by > > > Ubuntu - > > > to enable me to study and try on LiveCD - to have hands-on experience. > > > . > > > > > On 10/25/07, begemotv2718 <[email protected]> wrote: > > > > > > Well, the basic procedure for you case can be the following. > > > > a) You install MikTeX and some package for it that allows latex to > > > > understand kannada language, as I know there is > > > > package called itrans that work with this. However you need to > provide > > > > input for it using latin transliteration. > > > > > > b) You prepare training text (using transliteration) and process it > > > > with itrans and latex. You get at this stage a .dvi file that is > > > > typeseted in kannada language and contains (in a cryptic form) all > the > > > > information about the character boxes of your text. You extract this > > > > information using my perl script > > > > dvitype <file.dvi>| perl script1.pl > file.texbox > > > > > > c) You produce training image for tesseract. You can do this > > > > electronically using dvips + ghostscript > > > > dvips -o <ps.file> <dvi.file> > > > > gs -r300x300 -dNOPAUSE -sOutputFile=<file.tif> -sDEVICE=tiffg4 > > > > <ps.file> > > > > or by printing dvi file on printer and then scanning it. > > > > You run tesseract <file.tif> <file.txt> batch.nochop makebox > > > > You rename file.txt into file.box. > > > > > > d) You produce final box file for training tesseract using my second > > > > perl script > > > > perl correlatebox.pl file.texbox file.box > result_file.box > > > > My script automatically finds the correlation between the boxes in > the > > > > texbox file (produced from dvi) and the boxes recognized by > tesseract. > > > > It then replaces the correct character codes for tesseract boxes to > > > > produce the final box file for training. It detects possible problems > > > > like character splitting into two parts or collating of two > characters > > > > and handle them. At this stage no human intervention is necessary. > > > > > > e)You run the tesseract in the training mode and follow all the > > > > remaining steps (running mftrain, cntrain, etc., all this also does > > > > not require human intervention and can be done with shell script). > > > > > > f)You get the result and test it. > > > > > > All the operations excluding the initial text file preparation (and > > > > scanning, if you choose this option) can be automated, so that you > > > > just run shell script named like latex-train.sh with input of single > > > > text file and get on the exit all the tesseract files (normproto, > > > > pffmtable, etc..) which are the results of training. > > > > > > However, all this quite heavily rely on Unix operating system > > > > environment: you need a working installation of latex, perl, > > > > ghostscript, and you need the ability to run all this from command > > > > line. I am not quite sure that doing all this is easy for Windows > > > > user: although such a possibilities exists in this system, it does > not > > > > coincide with general Windows philosophy to have black-box style GUI > > > > program that does everything. > > > > > > On Oct 24, 4:32 am, "74yrs old" <[email protected]> wrote: > > > > > At present I have baraha software (www.baraha.com). With help of > this > > > > > barahaIME, I have to edit the textbox generated by tesseract( i.e > by > > > > typing > > > > > in Kannada script).in Windows. > > > > > > > It is presumed that your point is that characters(font) in text > file > > > > > generated by running "tesseract fontfile.tif fontfile batch.nochop > > > > > makebox" does > > > > > not agree with resembalance (identical) > > > > > with original image characters(font) in the tiff file. your > suggestion > > > > to > > > > > automate the process with help of Latex is not clear. \Whether you > mean > > > > that > > > > > with help latex software, font image in tiff file can be copied to > in > > > > the > > > > > generated(i.e output) text file in addition to characters printed > by > > > > default > > > > > by tesseract? and if so, it is good idea. > > > > > I like to see output samples generated by you using perl script and > > > > latex. > > > > > -Sriranga(74yrsold) > > > > > > > On 10/24/07, begemotv2718 <[email protected]> wrote: > > > > > > > > For 74yrs old. I found some project supporting kannada language > for > > > > > > LaTeX by googling. > > > > > >http://ptsg.eecs.berkeley.edu/%7Evenkates/kannada.html > > > > > > > > Whether this project is user friendly enough I do not know. > > > > > > > > The problem is that one need to write an encoding translation > > > > > > procedure to be able to translate symbols from TeX internal font > > > > > > encoding into unicode. That may require some programming skill > and > > > > > > some moderate amount of research. > > > > > > > > On Oct 24, 2:20 am, "74yrs old" <[email protected]> wrote: > > > > > > > I want to automate training as suggested by begemotv2718. It > will > > > > be > > > > > > > appreciated if suitable > > > > > > > program is available. I am not a programmer. > > > > > > > > > On 10/24/07, Jeffrey Ratcliffe <[email protected]> > wrote: > > > > > > > > > > On 24/10/2007, 74yrs old <[email protected]> wrote: > > > > > > > > > Whether Latex will work in MSwindows and support for Indian > > > > > > langugates > > > > > > > > > like Kannada? If so how to use it? > > > > > > > > > > The MiKTeX project is an excellent Latex for M$ Windows. > Whether > > > > it > > > > > > > > supports Kannada, I don't know. > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/66ba9f94-3442-49da-8235-fa2c645ffcb8%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

