sed -f roman.sed inputfile.txt > outputfile.txt You will have to add other substitutions to the file roman.sed - it only has the first few substitutions that I encountered.
Shree Devi Kumar ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, Nov 27, 2013 at 7:08 PM, Jaanus Henno <[email protected]>wrote: > Thank you both for your help. This letter replacement is a good idea! > Looks like this sed script will do the work. I will just have to see how to > use sed... Tomorrow I will check it out. > > > On Wed, Nov 27, 2013 at 8:20 PM, Shree Devi Kumar <[email protected]>wrote: > >> I think rather than try to OCR, please extract the text and then run a >> conversion script to change the letters with diacritical marks. >> >> eg. you would do the following substitution using sed for the sample text >> from page 11 >> >> s/Å/Ā/g >> s/å/ā/g >> s/®/ṛ/g >> s/ß/ṣ/g >> s/∫/ṇ/g >> s/î/ī/g >> s/Ê/Ī/g >> s/¸/Ś/g >> s/Ω/ś/g >> s/ü/ū/g >> >> Also attaching sed script as a utf-8 text file. >> >> Shree Devi Kumar >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> >> On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat <[email protected]> wrote: >> >>> those Ā á character are defined in Garamond font, but the ASCII code >>> used in this document is not the same as defined in Garamond font. >>> >>> So, it is some other font where these ASCII codes have been defined for >>> this character. >>> >>> The document list a dozen fonts, some of it might be that. you need to >>> figure out which font it could be, by hammer hit trial error method. >>> >>> Thanks. >>> -- >>> Rawat >>> >>> >>> On 11/27/2013 3:17 PM, Jaanus Henno wrote: >>> >>>> Ok, you can try page 11. There is glossary and lots of words with >>>> diacritics. Thanks. >>>> >>>> >>>> On Wed, Nov 27, 2013 at 4:41 PM, V S Rawat <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> >>>> "words with sanskrit transliteration marks are used" >>>> >>>> could you please point out exact pages where to look for it. I will >>>> try to ocr it and see the results. >>>> >>>> Also, >>>> http://www.omkarananda-ashram.__org/Sanskrit/itranslator99._ >>>> _htm#downloads >>>> >>>> <http://www.omkarananda-ashram.org/Sanskrit/ >>>> itranslator99.htm#downloads> >>>> >>>> The above page and several links from that page also have a lot of >>>> Sanskrit fonts. Maybe someone might be used by you. >>>> >>>> Thanks. >>>> -- >>>> Rawat >>>> >>>> >>>> On 11/27/2013 9:16 AM, Srivas wrote: >>>> >>>> Hi Rawat! >>>> >>>> I'm really sorry, I didn't know that this is a mailing list >>>> type of >>>> forum ;-( >>>> >>>> Second, if you look carefully, you will see that the text is not >>>> entirely english. In many places words with sanskrit >>>> transliteration >>>> marks are used. But as you said, it can actually copy/pasted >>>> and it >>>> didn't even come to my mind! So this part is actually working >>>> and that >>>> is great! So I am almost there. The remaining problem is another >>>> type. >>>> The provided tamalten font will display the marks, but I need >>>> to use >>>> another font to display the final document. It also contains the >>>> same >>>> diacritical marks but uses another encoding. But this might be a >>>> question to another person, I know the author of the fonts, I >>>> will ask >>>> him. Thanks for the help! >>>> >>>> Btw. If anyone needs to use sanskrit transliterated fonts, here >>>> are the >>>> resources: http://www.krishna-das.com/__ksyberspace/fonts/ >>>> >>>> <http://www.krishna-das.com/ksyberspace/fonts/> >>>> >>>> On Tuesday, November 26, 2013 4:47:11 PM UTC+7, V S Rawat wrote: >>>> >>>> Dear Sir Srivas ji, >>>> >>>> firstly, you should not have sent 2.2 MB 68 page pdf file >>>> and 181 KB >>>> zip >>>> to all the list members unasked. You could have loaded it >>>> somewhere and >>>> sent the link so that only those download it who can >>>> contribute in it. >>>> It is a wastage of time and bandwidth to get such huge >>>> messages. >>>> >>>> Secondly, I couldn't really understand your issue. I saw >>>> your pdf file. >>>> it is pure English. You can open it in any pdf reader and >>>> just copy >>>> entire text from there and paste in a text or word file. >>>> So, what else >>>> exactly you are looking for, please elaborate. >>>> >>>> you don't even need to ocr it. These are already ASCII >>>> text. >>>> >>>> Thanks. >>>> -- >>>> Rawat >>>> >>>> >>>> On 11/26/2013 12:40 PM, Srivas wrote: >>>> > Hi! >>>> > I have a bunch of PDF files journals and I need to get >>>> the text >>>> out of >>>> > it. They contain a lot of romanized sanskrit diacritical >>>> marks >>>> and that >>>> > creates a difficulty. I tried Finereader and OmniPage >>>> but they >>>> cannot be >>>> > trained to recognize those symbols. I just need an ORC >>>> program I can >>>> > train to show any symbol required and the above programs >>>> cannot >>>> do that. >>>> > >>>> > Where should I start from? I feel like this program can >>>> do the >>>> job but >>>> > can you help me to get started? I downloaded tesseract >>>> and >>>> installed it >>>> > (windows). There are different GUIs available and I >>>> think it will >>>> make >>>> > it easier to work. Can you suggest a good one? I tried >>>> gimagereader but >>>> > it's too primitive and leaves a lot of work to be done >>>> afterwards >>>> with >>>> > the overall text. >>>> > >>>> > I don't think this kind of language pack is available >>>> and how to >>>> create it? >>>> > >>>> > I will add one pdf and fonts that were used to create >>>> it. Maybe >>>> someone >>>> > would like to try and let me know how to do it? >>>> > >>>> > Thank you for any help! >>>> > >>>> > Regards, >>>> > Srivas >>>> >>>> >>> -- >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To post to this group, send email to [email protected] >>> To unsubscribe from this group, send email to >>> [email protected] >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en >>> >>> --- You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> >>> For more options, visit https://groups.google.com/groups/opt_out. >>> >> >> -- >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected] >> To unsubscribe from this group, send email to >> [email protected] >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> >> --- >> You received this message because you are subscribed to a topic in the >> Google Groups "tesseract-ocr" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/tesseract-ocr/6uG7HUxLY7w/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> [email protected]. >> >> For more options, visit https://groups.google.com/groups/opt_out. >> > > -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

