Re: Need help in recognizing english texts with sanskrit roman diacritical marks.

V S Rawat Wed, 27 Nov 2013 08:00:10 -0800

That is very convenient solution, Shree Devi ji.

However, if sed or other "substitutors" are not there, or if one wantsto avoid using them, I think it can be done using built inpost-processing method of tesseract.

use san.DangAmbigs.txt or hin.DangAmbigs.txt whichever language you areusing.


then put them as
Å=Ā
one per line.

Should it work equally well and automatically, without needing manual step?

if so, then, Shree Devi ji, is there any major benefit of postprocessing in sed?


Please remind me where this DangAmbigs file is to be put?

Thanks.
--
Rawat

On 11/27/2013 6:50 PM, Shree Devi Kumar wrote:

I think rather than try to OCR, please extract the text and then run a
conversion script to change the letters with diacritical marks.

eg. you would do the following substitution using sed for the sample
text from page 11

s/Å/Ā/g
s/å/ā/g
s/®/ṛ/g
s/ß/ṣ/g
s/∫/ṇ/g
s/î/ī/g
s/Ê/Ī/g
s/¸/Ś/g
s/Ω/ś/g
s/ü/ū/g

Also attaching sed script as a utf-8 text file.

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat <[email protected]
<mailto:[email protected]>> wrote:

    those Ā á character are defined in Garamond font, but the ASCII code
    used in this document is not the same as defined in Garamond font.

    So, it is some other font where these ASCII codes have been defined
    for this character.

    The document list a dozen fonts, some of it might be that. you need
    to figure out which font it could be, by hammer hit trial error method.

    Thanks.
    --
    Rawat


    On 11/27/2013 3:17 PM, Jaanus Henno wrote:

        Ok, you can try page 11. There is glossary and lots of words with
        diacritics. Thanks.


        On Wed, Nov 27, 2013 at 4:41 PM, V S Rawat <[email protected]
        <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>> wrote:


             "words with sanskrit transliteration marks are used"

             could you please point out exact pages where to look for
        it. I will
             try to ocr it and see the results.

             Also,
        
http://www.omkarananda-ashram.____org/Sanskrit/itranslator99.____htm#downloads


        
<http://www.omkarananda-__ashram.org/Sanskrit/__itranslator99.htm#downloads
        
<http://www.omkarananda-ashram.org/Sanskrit/itranslator99.htm#downloads>>

             The above page and several links from that page also have a
        lot of
             Sanskrit fonts. Maybe someone might be used by you.

             Thanks.
             --
             Rawat


             On 11/27/2013 9:16 AM, Srivas wrote:

                 Hi Rawat!

                 I'm really sorry, I didn't know that this is a mailing
        list type of
                 forum ;-(

                 Second, if you look carefully, you will see that the
        text is not
                 entirely english. In many places words with sanskrit
        transliteration
                 marks are used. But as you said, it can actually
        copy/pasted and it
                 didn't even come to my mind! So this part is actually
        working
                 and that
                 is great! So I am almost there. The remaining problem
        is another
                 type.
                 The provided tamalten font will display the marks, but
        I need to use
                 another font to display the final document. It also
        contains the
                 same
                 diacritical marks but uses another encoding. But this
        might be a
                 question to another person, I know the author of the
        fonts, I
                 will ask
                 him. Thanks for the help!

                 Btw. If anyone needs to use sanskrit transliterated
        fonts, here
                 are the
                 resources:
        http://www.krishna-das.com/____ksyberspace/fonts/
        <http://www.krishna-das.com/__ksyberspace/fonts/>

                 <http://www.krishna-das.com/__ksyberspace/fonts/
        <http://www.krishna-das.com/ksyberspace/fonts/>>

                 On Tuesday, November 26, 2013 4:47:11 PM UTC+7, V S
        Rawat wrote:

                      Dear Sir Srivas ji,

                      firstly, you should not have sent 2.2 MB 68 page
        pdf file
                 and 181 KB
                      zip
                      to all the list members unasked. You could have
        loaded it
                 somewhere and
                      sent the link so that only those download it who can
                 contribute in it.
                      It is a wastage of time and bandwidth to get such huge
                 messages.

                      Secondly, I couldn't really understand your issue.
        I saw
                 your pdf file.
                      it is pure English. You can open it in any pdf
        reader and
                 just copy
                      entire text from there and paste in a text or word
        file.
                 So, what else
                      exactly you are looking for, please elaborate.

                      you don't even need to ocr it. These are already
        ASCII text.

                      Thanks.
                      --
                      Rawat


                      On 11/26/2013 12:40 PM, Srivas wrote:
                       > Hi!
                       > I have a bunch of PDF files journals and I need
        to get
                 the text
                      out of
                       > it. They contain a lot of romanized sanskrit
        diacritical
                 marks
                      and that
                       > creates a difficulty. I tried Finereader and
        OmniPage
                 but they
                      cannot be
                       > trained to recognize those symbols. I just need
        an ORC
                 program I can
                       > train to show any symbol required and the above
        programs
                 cannot
                      do that.
                       >
                       > Where should I start from? I feel like this
        program can
                 do the
                      job but
                       > can you help me to get started? I downloaded
        tesseract and
                      installed it
                       > (windows). There are different GUIs available and I
                 think it will
                      make
                       > it easier to work. Can you suggest a good one?
        I tried
                      gimagereader but
                       > it's too primitive and leaves a lot of work to
        be done
                 afterwards
                      with
                       > the overall text.
                       >
                       > I don't think this kind of language pack is
        available
                 and how to
                      create it?
                       >
                       > I will add one pdf and fonts that were used to
        create
                 it. Maybe
                      someone
                       > would like to try and let me know how to do it?
                       >
                       > Thank you for any help!
                       >
                       > Regards,
                       > Srivas


--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

---You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Need help in recognizing english texts with sanskrit roman diacritical marks.

Reply via email to