Re: Need help in recognizing english texts with sanskrit roman diacritical marks.

V S Rawat Thu, 28 Nov 2013 01:24:18 -0800

sed is the command line part of vim.

type : in vim (when not in insert mode)

The cursor will move to bottom line and a line will open where you cantype any sed command and press enter and it will run on the text, youcan also define line range on which it should operate, and also selectconfirmation mode for replace, a feature that I guess might not bepossible in sed script-wise replacement.


I hope I am saying it correctly.

Thanks.
--
Rawat

On 11/28/2013 9:31 AM, Jaanus Henno wrote:

How do you run sed on Vim?


On Thu, Nov 28, 2013 at 12:53 AM, V S Rawat <[email protected]
<mailto:[email protected]>> wrote:

    Yes, for Srivas ji's file text is 100% text, not images, and is 100%
    extractable to word/text file by simple copy paste. ocr is just not
    needed.

    Then, it is good that sed will make the changes without need of ocr.
    Good thought.

    I use vim on w8 so, I wouldn't downgrade to sed. he he. just
    kidding. vim has sed built in. :-)

    Thanks.
    --
    Rawat

    On 11/27/2013 9:50 PM, Shree Devi Kumar wrote:

        Rawatji,

        I was going by the assumption that the text can be easily
        extracted from
        his pdf by saving as txt. In that case just running the sed
        script will
        fix the text for the letters  with diacritics which were mapped
        to some
        other letters in the ascii font.

        Doing OCR never gives 100% correct result, so to use the OCR
        output and
        postprocess in this case may not be the best solution.

        You could try windows version of sed from
        http://gnuwin32.sourceforge.__net/packages/sed.htm
        <http://gnuwin32.sourceforge.net/packages/sed.htm>

        i only tested using one para of text from page 11.

        Shree

        Shree Devi Kumar
        ______________________________________________________________
        भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


        On Wed, Nov 27, 2013 at 9:29 PM, V S Rawat <[email protected]
        <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>> wrote:

             That is very convenient solution, Shree Devi ji.

             However, if sed or other "substitutors" are not there, or
        if one
             wants to avoid using them, I think it can be done using
        built in
             post-processing method of tesseract.

             use san.DangAmbigs.txt or hin.DangAmbigs.txt whichever
        language you
             are using.

             then put them as
             Å=Ā
             one per line.

             Should it work equally well and automatically, without needing
             manual step?

             if so, then, Shree Devi ji, is there any major benefit of post
             processing in sed?

             Please remind me where this DangAmbigs file is to be put?

             Thanks.
             --
             Rawat


             On 11/27/2013 6:50 PM, Shree Devi Kumar wrote:

                 I think rather than try to OCR, please extract the text
        and then
                 run a
                 conversion script to change the letters with
        diacritical marks.

                 eg. you would do the following substitution using sed
        for the sample
                 text from page 11

                 s/Å/Ā/g
                 s/å/ā/g
                 s/®/ṛ/g
                 s/ß/ṣ/g
                 s/∫/ṇ/g
                 s/î/ī/g
                 s/Ê/Ī/g
                 s/¸/Ś/g
                 s/Ω/ś/g
                 s/ü/ū/g

                 Also attaching sed script as a utf-8 text file.

                 Shree Devi Kumar

        __________________________________________________________________

                 भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


                 On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat
        <[email protected] <mailto:[email protected]>
                 <mailto:[email protected] <mailto:[email protected]>>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>> wrote:

                      those Ā á character are defined in Garamond font,
        but the
                 ASCII code
                      used in this document is not the same as defined in
                 Garamond font.

                      So, it is some other font where these ASCII codes
        have been
                 defined
                      for this character.

                      The document list a dozen fonts, some of it might
        be that.
                 you need
                      to figure out which font it could be, by hammer
        hit trial
                 error method.

                      Thanks.
                      --
                      Rawat


                      On 11/27/2013 3:17 PM, Jaanus Henno wrote:

                          Ok, you can try page 11. There is glossary and
        lots of
                 words with
                          diacritics. Thanks.


                          On Wed, Nov 27, 2013 at 4:41 PM, V S Rawat
                 <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
                          <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>
                          <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>>> wrote:


                               "words with sanskrit transliteration
        marks are used"

                               could you please point out exact pages
        where to
                 look for
                          it. I will
                               try to ocr it and see the results.

                               Also,
        
http://www.omkarananda-ashram.________org/Sanskrit/____itranslator99.____htm#____downloads




        
<http://www.omkarananda-______ashram.org/Sanskrit/______itranslator99.htm#downloads
        
<http://www.omkarananda-____ashram.org/Sanskrit/____itranslator99.htm#downloads>

        
<http://www.omkarananda-____ashram.org/Sanskrit/____itranslator99.htm#downloads
        
<http://www.omkarananda-__ashram.org/Sanskrit/__itranslator99.htm#downloads>>




        
<http://www.omkarananda-____ashram.org/Sanskrit/____itranslator99.htm#downloads
        
<http://www.omkarananda-__ashram.org/Sanskrit/__itranslator99.htm#downloads>

        
<http://www.omkarananda-__ashram.org/Sanskrit/__itranslator99.htm#downloads
        
<http://www.omkarananda-ashram.org/Sanskrit/itranslator99.htm#downloads>>>>

                               The above page and several links from
        that page
                 also have a
                          lot of
                               Sanskrit fonts. Maybe someone might be
        used by you.

                               Thanks.
                               --
                               Rawat


                               On 11/27/2013 9:16 AM, Srivas wrote:

                                   Hi Rawat!

                                   I'm really sorry, I didn't know that
        this is a
                 mailing
                          list type of
                                   forum ;-(

                                   Second, if you look carefully, you
        will see
                 that the
                          text is not
                                   entirely english. In many places
        words with
                 sanskrit
                          transliteration
                                   marks are used. But as you said, it
        can actually
                          copy/pasted and it
                                   didn't even come to my mind! So this
        part is
                 actually
                          working
                                   and that
                                   is great! So I am almost there. The
        remaining
                 problem
                          is another
                                   type.
                                   The provided tamalten font will
        display the
                 marks, but
                          I need to use
                                   another font to display the final
        document. It
                 also
                          contains the
                                   same
                                   diacritical marks but uses another
        encoding.
                 But this
                          might be a
                                   question to another person, I know
        the author
                 of the
                          fonts, I
                                   will ask
                                   him. Thanks for the help!

                                   Btw. If anyone needs to use sanskrit
                 transliterated
                          fonts, here
                                   are the
                                   resources:
        http://www.krishna-das.com/________ksyberspace/fonts/
        <http://www.krishna-das.com/______ksyberspace/fonts/>
                 <http://www.krishna-das.com/______ksyberspace/fonts/
        <http://www.krishna-das.com/____ksyberspace/fonts/>>


          <http://www.krishna-das.com/______ksyberspace/fonts/
        <http://www.krishna-das.com/____ksyberspace/fonts/>
                 <http://www.krishna-das.com/____ksyberspace/fonts/
        <http://www.krishna-das.com/__ksyberspace/fonts/>>>



                   <http://www.krishna-das.com/______ksyberspace/fonts/
        <http://www.krishna-das.com/____ksyberspace/fonts/>
                 <http://www.krishna-das.com/____ksyberspace/fonts/
        <http://www.krishna-das.com/__ksyberspace/fonts/>>

          <http://www.krishna-das.com/____ksyberspace/fonts/
        <http://www.krishna-das.com/__ksyberspace/fonts/>
                 <http://www.krishna-das.com/__ksyberspace/fonts/
        <http://www.krishna-das.com/ksyberspace/fonts/>>>>

                                   On Tuesday, November 26, 2013 4:47:11 PM
                 UTC+7, V S
                          Rawat wrote:

                                        Dear Sir Srivas ji,

                                        firstly, you should not have
        sent 2.2 MB
                 68 page
                          pdf file
                                   and 181 KB
                                        zip
                                        to all the list members unasked. You
                 could have
                          loaded it
                                   somewhere and
                                        sent the link so that only those
        download
                 it who can
                                   contribute in it.
                                        It is a wastage of time and
        bandwidth to
                 get such huge
                                   messages.

                                        Secondly, I couldn't really
        understand
                 your issue.
                          I saw
                                   your pdf file.
                                        it is pure English. You can open
        it in
                 any pdf
                          reader and
                                   just copy
                                        entire text from there and paste
        in a
                 text or word
                          file.
                                   So, what else
                                        exactly you are looking for, please
                 elaborate.

                                        you don't even need to ocr it.
        These are
                 already
                          ASCII text.

                                        Thanks.
                                        --
                                        Rawat


                                        On 11/26/2013 12:40 PM, Srivas
        wrote:
                                         > Hi!
                                         > I have a bunch of PDF files
        journals
                 and I need
                          to get
                                   the text
                                        out of
                                         > it. They contain a lot of
        romanized
                 sanskrit
                          diacritical
                                   marks
                                        and that
                                         > creates a difficulty. I tried
                 Finereader and
                          OmniPage
                                   but they
                                        cannot be
                                         > trained to recognize those
        symbols. I
                 just need
                          an ORC
                                   program I can
                                         > train to show any symbol
        required and
                 the above
                          programs
                                   cannot
                                        do that.
                                         >
                                         > Where should I start from? I
        feel like
                 this
                          program can
                                   do the
                                        job but
                                         > can you help me to get started? I
                 downloaded
                          tesseract and
                                        installed it
                                         > (windows). There are
        different GUIs
                 available and I
                                   think it will
                                        make
                                         > it easier to work. Can you
        suggest a
                 good one?
                          I tried
                                        gimagereader but
                                         > it's too primitive and leaves
        a lot of
                 work to
                          be done
                                   afterwards
                                        with
                                         > the overall text.
                                         >
                                         > I don't think this kind of
        language
                 pack is
                          available
                                   and how to
                                        create it?
                                         >
                                         > I will add one pdf and fonts
        that were
                 used to
                          create
                                   it. Maybe
                                        someone
                                         > would like to try and let me
        know how
                 to do it?
                                         >
                                         > Thank you for any help!
                                         >
                                         > Regards,
                                         > Srivas


--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

---You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Need help in recognizing english texts with sanskrit roman diacritical marks.

Reply via email to