I'm a little new to all that. How do you run sed under Windows 7? I read information about it and that it can also be run under windows but cannot understand how to do that.
On Wednesday, November 27, 2013 9:11:01 PM UTC+7, shree wrote: > > sed -f roman.sed inputfile.txt > outputfile.txt > > You will have to add other substitutions to the file roman.sed - it only > has the first few substitutions that I encountered. > > Shree Devi Kumar > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > > On Wed, Nov 27, 2013 at 7:08 PM, Jaanus Henno > <[email protected]<javascript:> > > wrote: > >> Thank you both for your help. This letter replacement is a good idea! >> Looks like this sed script will do the work. I will just have to see how to >> use sed... Tomorrow I will check it out. >> >> >> On Wed, Nov 27, 2013 at 8:20 PM, Shree Devi Kumar >> <[email protected]<javascript:> >> > wrote: >> >>> I think rather than try to OCR, please extract the text and then run a >>> conversion script to change the letters with diacritical marks. >>> >>> eg. you would do the following substitution using sed for the sample >>> text from page 11 >>> >>> s/Å/Ā/g >>> s/å/ā/g >>> s/®/ṛ/g >>> s/ß/ṣ/g >>> s/∫/ṇ/g >>> s/î/ī/g >>> s/Ê/Ī/g >>> s/¸/Ś/g >>> s/Ω/ś/g >>> s/ü/ū/g >>> >>> Also attaching sed script as a utf-8 text file. >>> >>> Shree Devi Kumar >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >>> >>> On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat <[email protected]<javascript:> >>> > wrote: >>> >>>> those Ā á character are defined in Garamond font, but the ASCII code >>>> used in this document is not the same as defined in Garamond font. >>>> >>>> So, it is some other font where these ASCII codes have been defined for >>>> this character. >>>> >>>> The document list a dozen fonts, some of it might be that. you need to >>>> figure out which font it could be, by hammer hit trial error method. >>>> >>>> Thanks. >>>> -- >>>> Rawat >>>> >>>> >>>> On 11/27/2013 3:17 PM, Jaanus Henno wrote: >>>> >>>>> Ok, you can try page 11. There is glossary and lots of words with >>>>> diacritics. Thanks. >>>>> >>>>> >>>>> On Wed, Nov 27, 2013 at 4:41 PM, V S Rawat <[email protected]<javascript:> >>>>> <mailto:[email protected] <javascript:>>> wrote: >>>>> >>>>> >>>>> "words with sanskrit transliteration marks are used" >>>>> >>>>> could you please point out exact pages where to look for it. I will >>>>> try to ocr it and see the results. >>>>> >>>>> Also, >>>>> http://www.omkarananda-ashram.__org/Sanskrit/itranslator99._ >>>>> _htm#downloads >>>>> >>>>> <http://www.omkarananda-ashram.org/Sanskrit/ >>>>> itranslator99.htm#downloads> >>>>> >>>>> The above page and several links from that page also have a lot of >>>>> Sanskrit fonts. Maybe someone might be used by you. >>>>> >>>>> Thanks. >>>>> -- >>>>> Rawat >>>>> >>>>> >>>>> On 11/27/2013 9:16 AM, Srivas wrote: >>>>> >>>>> Hi Rawat! >>>>> >>>>> I'm really sorry, I didn't know that this is a mailing list >>>>> type of >>>>> forum ;-( >>>>> >>>>> Second, if you look carefully, you will see that the text is >>>>> not >>>>> entirely english. In many places words with sanskrit >>>>> transliteration >>>>> marks are used. But as you said, it can actually copy/pasted >>>>> and it >>>>> didn't even come to my mind! So this part is actually working >>>>> and that >>>>> is great! So I am almost there. The remaining problem is >>>>> another >>>>> type. >>>>> The provided tamalten font will display the marks, but I need >>>>> to use >>>>> another font to display the final document. It also contains >>>>> the >>>>> same >>>>> diacritical marks but uses another encoding. But this might be >>>>> a >>>>> question to another person, I know the author of the fonts, I >>>>> will ask >>>>> him. Thanks for the help! >>>>> >>>>> Btw. If anyone needs to use sanskrit transliterated fonts, here >>>>> are the >>>>> resources: http://www.krishna-das.com/__ksyberspace/fonts/ >>>>> >>>>> <http://www.krishna-das.com/ksyberspace/fonts/> >>>>> >>>>> On Tuesday, November 26, 2013 4:47:11 PM UTC+7, V S Rawat >>>>> wrote: >>>>> >>>>> Dear Sir Srivas ji, >>>>> >>>>> firstly, you should not have sent 2.2 MB 68 page pdf file >>>>> and 181 KB >>>>> zip >>>>> to all the list members unasked. You could have loaded it >>>>> somewhere and >>>>> sent the link so that only those download it who can >>>>> contribute in it. >>>>> It is a wastage of time and bandwidth to get such huge >>>>> messages. >>>>> >>>>> Secondly, I couldn't really understand your issue. I saw >>>>> your pdf file. >>>>> it is pure English. You can open it in any pdf reader and >>>>> just copy >>>>> entire text from there and paste in a text or word file. >>>>> So, what else >>>>> exactly you are looking for, please elaborate. >>>>> >>>>> you don't even need to ocr it. These are already ASCII >>>>> text. >>>>> >>>>> Thanks. >>>>> -- >>>>> Rawat >>>>> >>>>> >>>>> On 11/26/2013 12:40 PM, Srivas wrote: >>>>> > Hi! >>>>> > I have a bunch of PDF files journals and I need to get >>>>> the text >>>>> out of >>>>> > it. They contain a lot of romanized sanskrit >>>>> diacritical >>>>> marks >>>>> and that >>>>> > creates a difficulty. I tried Finereader and OmniPage >>>>> but they >>>>> cannot be >>>>> > trained to recognize those symbols. I just need an ORC >>>>> program I can >>>>> > train to show any symbol required and the above >>>>> programs >>>>> cannot >>>>> do that. >>>>> > >>>>> > Where should I start from? I feel like this program can >>>>> do the >>>>> job but >>>>> > can you help me to get started? I downloaded tesseract >>>>> and >>>>> installed it >>>>> > (windows). There are different GUIs available and I >>>>> think it will >>>>> make >>>>> > it easier to work. Can you suggest a good one? I tried >>>>> gimagereader but >>>>> > it's too primitive and leaves a lot of work to be done >>>>> afterwards >>>>> with >>>>> > the overall text. >>>>> > >>>>> > I don't think this kind of language pack is available >>>>> and how to >>>>> create it? >>>>> > >>>>> > I will add one pdf and fonts that were used to create >>>>> it. Maybe >>>>> someone >>>>> > would like to try and let me know how to do it? >>>>> > >>>>> > Thank you for any help! >>>>> > >>>>> > Regards, >>>>> > Srivas >>>>> >>>>> >>>> -- >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To post to this group, send email to >>>> [email protected]<javascript:> >>>> To unsubscribe from this group, send email to >>>> [email protected] <javascript:> >>>> For more options, visit this group at >>>> http://groups.google.com/group/tesseract-ocr?hl=en >>>> >>>> --- You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected] <javascript:>. >>>> >>>> For more options, visit https://groups.google.com/groups/opt_out. >>>> >>> >>> -- >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To post to this group, send email to [email protected]<javascript:> >>> To unsubscribe from this group, send email to >>> [email protected] <javascript:> >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en >>> >>> --- >>> You received this message because you are subscribed to a topic in the >>> Google Groups "tesseract-ocr" group. >>> To unsubscribe from this topic, visit >>> https://groups.google.com/d/topic/tesseract-ocr/6uG7HUxLY7w/unsubscribe. >>> To unsubscribe from this group and all its topics, send an email to >>> [email protected] <javascript:>. >>> >>> For more options, visit https://groups.google.com/groups/opt_out. >>> >> >> -- >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected]<javascript:> >> To unsubscribe from this group, send email to >> [email protected] <javascript:> >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> >> --- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> For more options, visit https://groups.google.com/groups/opt_out. >> > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

