[tesseract-ocr] Re: Using latex to train tesseract

lauhlau Tue, 18 Nov 2014 12:25:22 -0800

Hi,

I am trying to do what you did. I noticed that this topic is reeeeeaaaally 
old (7 years !).


But I could not download your script files script1.pl and correlatebox.pl 
<http://correlatebox.pl> (404 not found error).

Do you still have them anywhere ?

Thanks in advance

Le dimanche 28 octobre 2007 11:29:10 UTC+1, begemotv2718 a écrit :
>
> http://tesseract-ocr.googlegroups.com/web/latex_train_kannada.tgz
>
> Some instructions.
> To experiment with all this on Unix system you need to have some
> packages installed.
> First of all, you need itrans package, which allows to typeset kannada
> language from the transliterated input file.
> If you type
> your_machine>sudo apt-get install itrans itrans-fonts itrans-doc
> this will install itrans package and some latex package, since itrans
> depends on it.
>
> Secondly, you need Font::TFM perl package. Unfortunately it is not in
> the standard distribution, so you need to run
> your_machine>sudo cpan install Font::TFM
> cpan will ask you several questions for which you may give the default
> answers and it will finally install the Font::TFM package.
>  You may want to install several dependencies to the Font::TFM package
> via apt-get
> your_machine>sudo apt-get install  libparse-yapp-perl libio-pty-perl
> libdate-manip-perl libxml-dom-xpath-perl
> before running cpan.
> I am sorry for this inconvenience. Currently I am trying to rewrite
> this part of my code in C, this will be easier then.
>
> You may also need texlive-extra-utils package :
> sudo apt-get install texlive-extra-utils.
> This will provide dvitype program.
>
> How to use all this
> First of all prepare *.itx file with transliterated text. I included
> two sample files for you, one contain smth like alphabet and the other
> some sample of poetry that I found in the itrans documentation. The
> transliteration scheme in this file should be similar to those that
> your Baraha software use.
>
> Then you process your file with itrans
>
> itrans <sample.itx >sample.tex
>
> Then you process it with latex to get the dvi file
>
> latex sample.tex
>
> After that you will have sample.dvi file. You need to open it with
> xdvi sample.dvi
> in order to have font files generated.
>
> Then you obtain the texbox file, that contain the boundary boxes for
> all characters as latex typesetted it.
> dvitype sample.dvi | perl scripts/kannada.pl > sample.texbox
>
> Than you produce tif file for tesseract
>
> dvips -o sample.ps sample.dvi
> gs -r300x300 -sOutputFile=sample.tif -sDEVICE=tiffg4 -dNOPAUSE
> sample.ps quit.ps
>
> Now you run tesseract to produce its own box files
> tesseract sample.tif sample batch.nochop makebox
>
> Then you run my second program to get final file to train tesseract:
> perl scripts/correlatebox.pl sample.texbox sample.txt >sample.box
>
> If you have imagemagick installed on your system, you may want to run
> another script
> perl scripts/draw2.pl sample.tif sample_dir sample.box >sample.html
> which will produce the html file that you can open in your browser,
> this file will contain pictures of characters cutted out from the tif
> file, as well as representation of this characters.
>
> Well and final training steps is usual
> tesseract sample.tif junk nobatch box.train
> mftrain sample.tr
> cntrain sample.tr
> etc.
>
> On Oct 28, 4:39 am, begemotv2718 <[email protected]> wrote:
> > Well, actually I run not Ubuntu but Debian Linux (which is like a
> > parent for Ubuntu) as well as Mac OS on the laptop computer. Most
> > things does not depend much of which Unix-like system you use. By the
> > way, you may try to install Cygwin on your Windows box, this is the
> > easiest way to turn Windows machine into (almost) fully functional
> > Unix without sacrificing anything from Windows itself.
> >
> > I experimented a little bit with Kannada language on the samples from
> > itrans package. The machinery seems to work OK. However, to further
> > work on this I need a person who knows the language.
> >
> > I am going to download the archive with some samples now.
> >
> > On Oct 25, 4:27 am, "74yrs old" <[email protected]> wrote:
> >
> > > Thanks for the detailed procedure . It appears you are using in Ubuntu
> > > LinuxOS
> > > and if so, is it possible to forward copy of  the typescript generated 
>  by
> > > Ubuntu -
> > > to enable me to study and try on LiveCD - to have  hands-on experience.
> > > .
> >
> > > On 10/25/07, begemotv2718 <[email protected]> wrote:
> >
> > > > Well, the basic procedure for you case can be the following.
> > > > a) You install MikTeX and some package for it that allows latex to
> > > > understand kannada language, as I know there is
> > > > package called itrans that work with this. However you need to 
> provide
> > > > input for it using latin transliteration.
> >
> > > > b) You prepare training text (using transliteration) and process it
> > > > with itrans and latex. You get at this stage a .dvi file that is
> > > > typeseted in kannada language and contains (in a cryptic form) all 
> the
> > > > information about the character boxes of your text. You extract this
> > > > information using my perl script
> > > > dvitype <file.dvi>| perl script1.pl > file.texbox
> >
> > > > c) You produce training image for  tesseract. You can do this
> > > > electronically using dvips + ghostscript
> > > > dvips -o <ps.file> <dvi.file>
> > > > gs -r300x300 -dNOPAUSE -sOutputFile=<file.tif> -sDEVICE=tiffg4
> > > > <ps.file>
> > > > or by printing  dvi file on printer and then  scanning it.
> > > > You run tesseract <file.tif> <file.txt> batch.nochop makebox
> > > > You rename file.txt into file.box.
> >
> > > > d) You produce final box file for training tesseract using my second
> > > > perl script
> > > > perl correlatebox.pl file.texbox file.box > result_file.box
> > > > My script automatically finds the correlation between the boxes in 
> the
> > > > texbox file (produced from dvi) and the boxes recognized by 
> tesseract.
> > > > It then replaces the correct character codes for tesseract boxes to
> > > > produce the final box file for training. It detects possible problems
> > > > like character splitting into two parts or collating of two 
> characters
> > > > and handle them. At this stage no human intervention is necessary.
> >
> > > > e)You run the tesseract in the training mode and follow all the
> > > > remaining steps (running mftrain, cntrain, etc., all this also does
> > > > not require human intervention and can be done with shell script).
> >
> > > > f)You get the result and test it.
> >
> > > > All the operations excluding the initial text file preparation (and
> > > > scanning, if you choose this option) can be automated, so that you
> > > > just run shell script named like latex-train.sh with input of single
> > > > text file and get on the exit all the tesseract files (normproto,
> > > > pffmtable, etc..)  which are the results of training.
> >
> > > > However, all this quite heavily rely on Unix operating system
> > > > environment: you need a working installation of latex, perl,
> > > > ghostscript, and you need the ability to run all this from command
> > > > line. I am not quite sure that doing all this is easy for Windows
> > > > user: although such a possibilities exists in this system, it does 
> not
> > > > coincide with general Windows philosophy to have black-box style GUI
> > > > program that does everything.
> >
> > > > On Oct 24, 4:32 am, "74yrs old" <[email protected]> wrote:
> > > > > At present I have baraha software (www.baraha.com).  With help of 
> this
> > > > > barahaIME, I have to edit the textbox generated by tesseract( i.e 
> by
> > > > typing
> > > > > in Kannada script).in Windows.
> >
> > > > >  It is presumed that your point is that characters(font) in text 
> file
> > > > > generated by running "tesseract fontfile.tif fontfile batch.nochop
> > > > > makebox" does
> > > > > not agree with resembalance (identical)
> > > > > with original  image characters(font) in the tiff file.  your 
> suggestion
> > > > to
> > > > > automate the process with help of Latex is not clear. \Whether you 
> mean
> > > > that
> > > > > with help latex software, font image in tiff file can be copied to 
>  in
> > > > the
> > > > > generated(i.e output) text file in addition to characters printed 
> by
> > > > default
> > > > > by tesseract? and if so, it is  good idea.
> > > > > I like to see output samples generated by you using perl script and
> > > > latex.
> > > > > -Sriranga(74yrsold)
> >
> > > > > On 10/24/07, begemotv2718 <[email protected]> wrote:
> >
> > > > > > For 74yrs old. I found some project supporting kannada language 
>  for
> > > > > > LaTeX by googling.
> > > > > >http://ptsg.eecs.berkeley.edu/%7Evenkates/kannada.html
> >
> > > > > > Whether this project is user friendly enough I do not know.
> >
> > > > > > The problem is that one need to write an encoding translation
> > > > > > procedure to be able to translate symbols from TeX internal font
> > > > > > encoding into unicode. That may require some programming skill 
> and
> > > > > > some moderate amount of research.
> >
> > > > > > On Oct 24, 2:20 am, "74yrs old" <[email protected]> wrote:
> > > > > > > I want to automate training as suggested by  begemotv2718.  It 
> will
> > > > be
> > > > > > > appreciated  if  suitable
> > > > > > > program is available.  I am  not a programmer.
> >
> > > > > > > On 10/24/07, Jeffrey Ratcliffe <[email protected]> 
> wrote:
> >
> > > > > > > > On 24/10/2007, 74yrs old <[email protected]> wrote:
> > > > > > > > > Whether Latex will work in MSwindows and support for Indian
> > > > > > langugates
> > > > > > > > > like Kannada? If so how to use it?
> >
> > > > > > > > The MiKTeX project is an excellent Latex for M$ Windows. 
> Whether
> > > > it
> > > > > > > > supports Kannada, I don't know.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/66ba9f94-3442-49da-8235-fa2c645ffcb8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Using latex to train tesseract

Reply via email to