Thank you both for your help. This letter replacement is a good idea! Looks like this sed script will do the work. I will just have to see how to use sed... Tomorrow I will check it out.
On Wed, Nov 27, 2013 at 8:20 PM, Shree Devi Kumar <[email protected]>wrote: > I think rather than try to OCR, please extract the text and then run a > conversion script to change the letters with diacritical marks. > > eg. you would do the following substitution using sed for the sample text > from page 11 > > s/Å/Ā/g > s/å/ā/g > s/®/ṛ/g > s/ß/ṣ/g > s/∫/ṇ/g > s/î/ī/g > s/Ê/Ī/g > s/¸/Ś/g > s/Ω/ś/g > s/ü/ū/g > > Also attaching sed script as a utf-8 text file. > > Shree Devi Kumar > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > > On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat <[email protected]> wrote: > >> those Ā á character are defined in Garamond font, but the ASCII code used >> in this document is not the same as defined in Garamond font. >> >> So, it is some other font where these ASCII codes have been defined for >> this character. >> >> The document list a dozen fonts, some of it might be that. you need to >> figure out which font it could be, by hammer hit trial error method. >> >> Thanks. >> -- >> Rawat >> >> >> On 11/27/2013 3:17 PM, Jaanus Henno wrote: >> >>> Ok, you can try page 11. There is glossary and lots of words with >>> diacritics. Thanks. >>> >>> >>> On Wed, Nov 27, 2013 at 4:41 PM, V S Rawat <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> >>> "words with sanskrit transliteration marks are used" >>> >>> could you please point out exact pages where to look for it. I will >>> try to ocr it and see the results. >>> >>> Also, >>> http://www.omkarananda-ashram.__org/Sanskrit/itranslator99._ >>> _htm#downloads >>> >>> <http://www.omkarananda-ashram.org/Sanskrit/ >>> itranslator99.htm#downloads> >>> >>> The above page and several links from that page also have a lot of >>> Sanskrit fonts. Maybe someone might be used by you. >>> >>> Thanks. >>> -- >>> Rawat >>> >>> >>> On 11/27/2013 9:16 AM, Srivas wrote: >>> >>> Hi Rawat! >>> >>> I'm really sorry, I didn't know that this is a mailing list type >>> of >>> forum ;-( >>> >>> Second, if you look carefully, you will see that the text is not >>> entirely english. In many places words with sanskrit >>> transliteration >>> marks are used. But as you said, it can actually copy/pasted and >>> it >>> didn't even come to my mind! So this part is actually working >>> and that >>> is great! So I am almost there. The remaining problem is another >>> type. >>> The provided tamalten font will display the marks, but I need to >>> use >>> another font to display the final document. It also contains the >>> same >>> diacritical marks but uses another encoding. But this might be a >>> question to another person, I know the author of the fonts, I >>> will ask >>> him. Thanks for the help! >>> >>> Btw. If anyone needs to use sanskrit transliterated fonts, here >>> are the >>> resources: http://www.krishna-das.com/__ksyberspace/fonts/ >>> >>> <http://www.krishna-das.com/ksyberspace/fonts/> >>> >>> On Tuesday, November 26, 2013 4:47:11 PM UTC+7, V S Rawat wrote: >>> >>> Dear Sir Srivas ji, >>> >>> firstly, you should not have sent 2.2 MB 68 page pdf file >>> and 181 KB >>> zip >>> to all the list members unasked. You could have loaded it >>> somewhere and >>> sent the link so that only those download it who can >>> contribute in it. >>> It is a wastage of time and bandwidth to get such huge >>> messages. >>> >>> Secondly, I couldn't really understand your issue. I saw >>> your pdf file. >>> it is pure English. You can open it in any pdf reader and >>> just copy >>> entire text from there and paste in a text or word file. >>> So, what else >>> exactly you are looking for, please elaborate. >>> >>> you don't even need to ocr it. These are already ASCII text. >>> >>> Thanks. >>> -- >>> Rawat >>> >>> >>> On 11/26/2013 12:40 PM, Srivas wrote: >>> > Hi! >>> > I have a bunch of PDF files journals and I need to get >>> the text >>> out of >>> > it. They contain a lot of romanized sanskrit diacritical >>> marks >>> and that >>> > creates a difficulty. I tried Finereader and OmniPage >>> but they >>> cannot be >>> > trained to recognize those symbols. I just need an ORC >>> program I can >>> > train to show any symbol required and the above programs >>> cannot >>> do that. >>> > >>> > Where should I start from? I feel like this program can >>> do the >>> job but >>> > can you help me to get started? I downloaded tesseract >>> and >>> installed it >>> > (windows). There are different GUIs available and I >>> think it will >>> make >>> > it easier to work. Can you suggest a good one? I tried >>> gimagereader but >>> > it's too primitive and leaves a lot of work to be done >>> afterwards >>> with >>> > the overall text. >>> > >>> > I don't think this kind of language pack is available >>> and how to >>> create it? >>> > >>> > I will add one pdf and fonts that were used to create >>> it. Maybe >>> someone >>> > would like to try and let me know how to do it? >>> > >>> > Thank you for any help! >>> > >>> > Regards, >>> > Srivas >>> >>> >> -- >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected] >> To unsubscribe from this group, send email to >> [email protected] >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> >> --- You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> >> For more options, visit https://groups.google.com/groups/opt_out. >> > > -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to a topic in the > Google Groups "tesseract-ocr" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/tesseract-ocr/6uG7HUxLY7w/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

