Re: Need help in recognizing english texts with sanskrit roman diacritical marks.

Shree Devi Kumar Wed, 27 Nov 2013 05:21:53 -0800

I think rather than try to OCR, please extract the text and then run a
conversion script to change the letters with diacritical marks.


eg. you would do the following substitution using sed for the sample text
from page 11

s/Å/Ā/g
s/å/ā/g
s/®/ṛ/g
s/ß/ṣ/g
s/∫/ṇ/g
s/î/ī/g
s/Ê/Ī/g
s/¸/Ś/g
s/Ω/ś/g
s/ü/ū/g

Also attaching sed script as a utf-8 text file.

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat <[email protected]> wrote:

> those Ā á character are defined in Garamond font, but the ASCII code used
> in this document is not the same as defined in Garamond font.
>
> So, it is some other font where these ASCII codes have been defined for
> this character.
>
> The document list a dozen fonts, some of it might be that. you need to
> figure out which font it could be, by hammer hit trial error method.
>
> Thanks.
> --
> Rawat
>
>
> On 11/27/2013 3:17 PM, Jaanus Henno wrote:
>
>> Ok, you can try page 11. There is glossary and lots of words with
>> diacritics. Thanks.
>>
>>
>> On Wed, Nov 27, 2013 at 4:41 PM, V S Rawat <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>
>>     "words with sanskrit transliteration marks are used"
>>
>>     could you please point out exact pages where to look for it. I will
>>     try to ocr it and see the results.
>>
>>     Also,
>>     http://www.omkarananda-ashram.__org/Sanskrit/itranslator99._
>> _htm#downloads
>>
>>     <http://www.omkarananda-ashram.org/Sanskrit/
>> itranslator99.htm#downloads>
>>
>>     The above page and several links from that page also have a lot of
>>     Sanskrit fonts. Maybe someone might be used by you.
>>
>>     Thanks.
>>     --
>>     Rawat
>>
>>
>>     On 11/27/2013 9:16 AM, Srivas wrote:
>>
>>         Hi Rawat!
>>
>>         I'm really sorry, I didn't know that this is a mailing list type
>> of
>>         forum ;-(
>>
>>         Second, if you look carefully, you will see that the text is not
>>         entirely english. In many places words with sanskrit
>> transliteration
>>         marks are used. But as you said, it can actually copy/pasted and
>> it
>>         didn't even come to my mind! So this part is actually working
>>         and that
>>         is great! So I am almost there. The remaining problem is another
>>         type.
>>         The provided tamalten font will display the marks, but I need to
>> use
>>         another font to display the final document. It also contains the
>>         same
>>         diacritical marks but uses another encoding. But this might be a
>>         question to another person, I know the author of the fonts, I
>>         will ask
>>         him. Thanks for the help!
>>
>>         Btw. If anyone needs to use sanskrit transliterated fonts, here
>>         are the
>>         resources: http://www.krishna-das.com/__ksyberspace/fonts/
>>
>>         <http://www.krishna-das.com/ksyberspace/fonts/>
>>
>>         On Tuesday, November 26, 2013 4:47:11 PM UTC+7, V S Rawat wrote:
>>
>>              Dear Sir Srivas ji,
>>
>>              firstly, you should not have sent 2.2 MB 68 page pdf file
>>         and 181 KB
>>              zip
>>              to all the list members unasked. You could have loaded it
>>         somewhere and
>>              sent the link so that only those download it who can
>>         contribute in it.
>>              It is a wastage of time and bandwidth to get such huge
>>         messages.
>>
>>              Secondly, I couldn't really understand your issue. I saw
>>         your pdf file.
>>              it is pure English. You can open it in any pdf reader and
>>         just copy
>>              entire text from there and paste in a text or word file.
>>         So, what else
>>              exactly you are looking for, please elaborate.
>>
>>              you don't even need to ocr it. These are already ASCII text.
>>
>>              Thanks.
>>              --
>>              Rawat
>>
>>
>>              On 11/26/2013 12:40 PM, Srivas wrote:
>>               > Hi!
>>               > I have a bunch of PDF files journals and I need to get
>>         the text
>>              out of
>>               > it. They contain a lot of romanized sanskrit diacritical
>>         marks
>>              and that
>>               > creates a difficulty. I tried Finereader and OmniPage
>>         but they
>>              cannot be
>>               > trained to recognize those symbols. I just need an ORC
>>         program I can
>>               > train to show any symbol required and the above programs
>>         cannot
>>              do that.
>>               >
>>               > Where should I start from? I feel like this program can
>>         do the
>>              job but
>>               > can you help me to get started? I downloaded tesseract and
>>              installed it
>>               > (windows). There are different GUIs available and I
>>         think it will
>>              make
>>               > it easier to work. Can you suggest a good one? I tried
>>              gimagereader but
>>               > it's too primitive and leaves a lot of work to be done
>>         afterwards
>>              with
>>               > the overall text.
>>               >
>>               > I don't think this kind of language pack is available
>>         and how to
>>              create it?
>>               >
>>               > I will add one pdf and fonts that were used to create
>>         it. Maybe
>>              someone
>>               > would like to try and let me know how to do it?
>>               >
>>               > Thank you for any help!
>>               >
>>               > Regards,
>>               > Srivas
>>
>>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> --- You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

roman.sed
Description: Binary data

Re: Need help in recognizing english texts with sanskrit roman diacritical marks.

Reply via email to