Re: Need help in recognizing english texts with sanskrit roman diacritical marks.

V S Rawat Wed, 27 Nov 2013 09:54:58 -0800

Yes, for Srivas ji's file text is 100% text, not images, and is 100%extractable to word/text file by simple copy paste. ocr is just not needed.

Then, it is good that sed will make the changes without need of ocr.Good thought.

I use vim on w8 so, I wouldn't downgrade to sed. he he. just kidding.vim has sed built in. :-)


Thanks.
--
Rawat

On 11/27/2013 9:50 PM, Shree Devi Kumar wrote:

Rawatji,

I was going by the assumption that the text can be easily extracted from
his pdf by saving as txt. In that case just running the sed script will
fix the text for the letters  with diacritics which were mapped to some
other letters in the ascii font.

Doing OCR never gives 100% correct result, so to use the OCR output and
postprocess in this case may not be the best solution.

You could try windows version of sed from
http://gnuwin32.sourceforge.net/packages/sed.htm

i only tested using one para of text from page 11.

Shree

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Wed, Nov 27, 2013 at 9:29 PM, V S Rawat <[email protected]
<mailto:[email protected]>> wrote:

    That is very convenient solution, Shree Devi ji.

    However, if sed or other "substitutors" are not there, or if one
    wants to avoid using them, I think it can be done using built in
    post-processing method of tesseract.

    use san.DangAmbigs.txt or hin.DangAmbigs.txt whichever language you
    are using.

    then put them as
    Å=Ā
    one per line.

    Should it work equally well and automatically, without needing
    manual step?

    if so, then, Shree Devi ji, is there any major benefit of post
    processing in sed?

    Please remind me where this DangAmbigs file is to be put?

    Thanks.
    --
    Rawat


    On 11/27/2013 6:50 PM, Shree Devi Kumar wrote:

        I think rather than try to OCR, please extract the text and then
        run a
        conversion script to change the letters with diacritical marks.

        eg. you would do the following substitution using sed for the sample
        text from page 11

        s/Å/Ā/g
        s/å/ā/g
        s/®/ṛ/g
        s/ß/ṣ/g
        s/∫/ṇ/g
        s/î/ī/g
        s/Ê/Ī/g
        s/¸/Ś/g
        s/Ω/ś/g
        s/ü/ū/g

        Also attaching sed script as a utf-8 text file.

        Shree Devi Kumar
        ______________________________________________________________
        भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


        On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat <[email protected]
        <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>> wrote:

             those Ā á character are defined in Garamond font, but the
        ASCII code
             used in this document is not the same as defined in
        Garamond font.

             So, it is some other font where these ASCII codes have been
        defined
             for this character.

             The document list a dozen fonts, some of it might be that.
        you need
             to figure out which font it could be, by hammer hit trial
        error method.

             Thanks.
             --
             Rawat


             On 11/27/2013 3:17 PM, Jaanus Henno wrote:

                 Ok, you can try page 11. There is glossary and lots of
        words with
                 diacritics. Thanks.


                 On Wed, Nov 27, 2013 at 4:41 PM, V S Rawat
        <[email protected] <mailto:[email protected]>
                 <mailto:[email protected] <mailto:[email protected]>>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>> wrote:


                      "words with sanskrit transliteration marks are used"

                      could you please point out exact pages where to
        look for
                 it. I will
                      try to ocr it and see the results.

                      Also,
        
http://www.omkarananda-ashram.______org/Sanskrit/__itranslator99.____htm#__downloads



        
<http://www.omkarananda-____ashram.org/Sanskrit/____itranslator99.htm#downloads
        
<http://www.omkarananda-__ashram.org/Sanskrit/__itranslator99.htm#downloads>


        
<http://www.omkarananda-__ashram.org/Sanskrit/__itranslator99.htm#downloads
        
<http://www.omkarananda-ashram.org/Sanskrit/itranslator99.htm#downloads>>>

                      The above page and several links from that page
        also have a
                 lot of
                      Sanskrit fonts. Maybe someone might be used by you.

                      Thanks.
                      --
                      Rawat


                      On 11/27/2013 9:16 AM, Srivas wrote:

                          Hi Rawat!

                          I'm really sorry, I didn't know that this is a
        mailing
                 list type of
                          forum ;-(

                          Second, if you look carefully, you will see
        that the
                 text is not
                          entirely english. In many places words with
        sanskrit
                 transliteration
                          marks are used. But as you said, it can actually
                 copy/pasted and it
                          didn't even come to my mind! So this part is
        actually
                 working
                          and that
                          is great! So I am almost there. The remaining
        problem
                 is another
                          type.
                          The provided tamalten font will display the
        marks, but
                 I need to use
                          another font to display the final document. It
        also
                 contains the
                          same
                          diacritical marks but uses another encoding.
        But this
                 might be a
                          question to another person, I know the author
        of the
                 fonts, I
                          will ask
                          him. Thanks for the help!

                          Btw. If anyone needs to use sanskrit
        transliterated
                 fonts, here
                          are the
                          resources:
        http://www.krishna-das.com/______ksyberspace/fonts/
        <http://www.krishna-das.com/____ksyberspace/fonts/>
                 <http://www.krishna-das.com/____ksyberspace/fonts/
        <http://www.krishna-das.com/__ksyberspace/fonts/>>



          <http://www.krishna-das.com/____ksyberspace/fonts/
        <http://www.krishna-das.com/__ksyberspace/fonts/>
                 <http://www.krishna-das.com/__ksyberspace/fonts/
        <http://www.krishna-das.com/ksyberspace/fonts/>>>

                          On Tuesday, November 26, 2013 4:47:11 PM
        UTC+7, V S
                 Rawat wrote:

                               Dear Sir Srivas ji,

                               firstly, you should not have sent 2.2 MB
        68 page
                 pdf file
                          and 181 KB
                               zip
                               to all the list members unasked. You
        could have
                 loaded it
                          somewhere and
                               sent the link so that only those download
        it who can
                          contribute in it.
                               It is a wastage of time and bandwidth to
        get such huge
                          messages.

                               Secondly, I couldn't really understand
        your issue.
                 I saw
                          your pdf file.
                               it is pure English. You can open it in
        any pdf
                 reader and
                          just copy
                               entire text from there and paste in a
        text or word
                 file.
                          So, what else
                               exactly you are looking for, please
        elaborate.

                               you don't even need to ocr it. These are
        already
                 ASCII text.

                               Thanks.
                               --
                               Rawat


                               On 11/26/2013 12:40 PM, Srivas wrote:
                                > Hi!
                                > I have a bunch of PDF files journals
        and I need
                 to get
                          the text
                               out of
                                > it. They contain a lot of romanized
        sanskrit
                 diacritical
                          marks
                               and that
                                > creates a difficulty. I tried
        Finereader and
                 OmniPage
                          but they
                               cannot be
                                > trained to recognize those symbols. I
        just need
                 an ORC
                          program I can
                                > train to show any symbol required and
        the above
                 programs
                          cannot
                               do that.
                                >
                                > Where should I start from? I feel like
        this
                 program can
                          do the
                               job but
                                > can you help me to get started? I
        downloaded
                 tesseract and
                               installed it
                                > (windows). There are different GUIs
        available and I
                          think it will
                               make
                                > it easier to work. Can you suggest a
        good one?
                 I tried
                               gimagereader but
                                > it's too primitive and leaves a lot of
        work to
                 be done
                          afterwards
                               with
                                > the overall text.
                                >
                                > I don't think this kind of language
        pack is
                 available
                          and how to
                               create it?
                                >
                                > I will add one pdf and fonts that were
        used to
                 create
                          it. Maybe
                               someone
                                > would like to try and let me know how
        to do it?
                                >
                                > Thank you for any help!
                                >
                                > Regards,
                                > Srivas


--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

---You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Need help in recognizing english texts with sanskrit roman diacritical marks.

Reply via email to