Ok, somehow I missed last part of the conversation. I will try out those Windows based options you mentioned for sed.
On Thu, Nov 28, 2013 at 10:13 AM, Srivas <[email protected]> wrote: > I'm a little new to all that. How do you run sed under Windows 7? I read > information about it and that it can also be run under windows but cannot > understand how to do that. > > > On Wednesday, November 27, 2013 9:11:01 PM UTC+7, shree wrote: > >> sed -f roman.sed inputfile.txt > outputfile.txt >> >> You will have to add other substitutions to the file roman.sed - it only >> has the first few substitutions that I encountered. >> >> Shree Devi Kumar >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> >> On Wed, Nov 27, 2013 at 7:08 PM, Jaanus Henno <[email protected]>wrote: >> >>> Thank you both for your help. This letter replacement is a good idea! >>> Looks like this sed script will do the work. I will just have to see how to >>> use sed... Tomorrow I will check it out. >>> >>> >>> On Wed, Nov 27, 2013 at 8:20 PM, Shree Devi Kumar <[email protected]>wrote: >>> >>>> I think rather than try to OCR, please extract the text and then run a >>>> conversion script to change the letters with diacritical marks. >>>> >>>> eg. you would do the following substitution using sed for the sample >>>> text from page 11 >>>> >>>> s/Å/Ā/g >>>> s/å/ā/g >>>> s/®/ṛ/g >>>> s/ß/ṣ/g >>>> s/∫/ṇ/g >>>> s/î/ī/g >>>> s/Ê/Ī/g >>>> s/¸/Ś/g >>>> s/Ω/ś/g >>>> s/ü/ū/g >>>> >>>> Also attaching sed script as a utf-8 text file. >>>> >>>> Shree Devi Kumar >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>>> >>>> On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat <[email protected]> wrote: >>>> >>>>> those Ā á character are defined in Garamond font, but the ASCII code >>>>> used in this document is not the same as defined in Garamond font. >>>>> >>>>> So, it is some other font where these ASCII codes have been defined >>>>> for this character. >>>>> >>>>> The document list a dozen fonts, some of it might be that. you need to >>>>> figure out which font it could be, by hammer hit trial error method. >>>>> >>>>> Thanks. >>>>> -- >>>>> Rawat >>>>> >>>>> >>>>> On 11/27/2013 3:17 PM, Jaanus Henno wrote: >>>>> >>>>>> Ok, you can try page 11. There is glossary and lots of words with >>>>>> diacritics. Thanks. >>>>>> >>>>>> >>>>>> On Wed, Nov 27, 2013 at 4:41 PM, V S Rawat <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> >>>>>> >>>>>> "words with sanskrit transliteration marks are used" >>>>>> >>>>>> could you please point out exact pages where to look for it. I >>>>>> will >>>>>> try to ocr it and see the results. >>>>>> >>>>>> Also, >>>>>> http://www.omkarananda-ashram.__org/Sanskrit/itranslator99._ >>>>>> _htm#downloads >>>>>> >>>>>> <http://www.omkarananda-ashram.org/Sanskrit/itranslator99. >>>>>> htm#downloads> >>>>>> >>>>>> The above page and several links from that page also have a lot of >>>>>> Sanskrit fonts. Maybe someone might be used by you. >>>>>> >>>>>> Thanks. >>>>>> -- >>>>>> Rawat >>>>>> >>>>>> >>>>>> On 11/27/2013 9:16 AM, Srivas wrote: >>>>>> >>>>>> Hi Rawat! >>>>>> >>>>>> I'm really sorry, I didn't know that this is a mailing list >>>>>> type of >>>>>> forum ;-( >>>>>> >>>>>> Second, if you look carefully, you will see that the text is >>>>>> not >>>>>> entirely english. In many places words with sanskrit >>>>>> transliteration >>>>>> marks are used. But as you said, it can actually copy/pasted >>>>>> and it >>>>>> didn't even come to my mind! So this part is actually working >>>>>> and that >>>>>> is great! So I am almost there. The remaining problem is >>>>>> another >>>>>> type. >>>>>> The provided tamalten font will display the marks, but I need >>>>>> to use >>>>>> another font to display the final document. It also contains >>>>>> the >>>>>> same >>>>>> diacritical marks but uses another encoding. But this might >>>>>> be a >>>>>> question to another person, I know the author of the fonts, I >>>>>> will ask >>>>>> him. Thanks for the help! >>>>>> >>>>>> Btw. If anyone needs to use sanskrit transliterated fonts, >>>>>> here >>>>>> are the >>>>>> resources: http://www.krishna-das.com/__ksyberspace/fonts/ >>>>>> >>>>>> <http://www.krishna-das.com/ksyberspace/fonts/> >>>>>> >>>>>> On Tuesday, November 26, 2013 4:47:11 PM UTC+7, V S Rawat >>>>>> wrote: >>>>>> >>>>>> Dear Sir Srivas ji, >>>>>> >>>>>> firstly, you should not have sent 2.2 MB 68 page pdf file >>>>>> and 181 KB >>>>>> zip >>>>>> to all the list members unasked. You could have loaded it >>>>>> somewhere and >>>>>> sent the link so that only those download it who can >>>>>> contribute in it. >>>>>> It is a wastage of time and bandwidth to get such huge >>>>>> messages. >>>>>> >>>>>> Secondly, I couldn't really understand your issue. I saw >>>>>> your pdf file. >>>>>> it is pure English. You can open it in any pdf reader and >>>>>> just copy >>>>>> entire text from there and paste in a text or word file. >>>>>> So, what else >>>>>> exactly you are looking for, please elaborate. >>>>>> >>>>>> you don't even need to ocr it. These are already ASCII >>>>>> text. >>>>>> >>>>>> Thanks. >>>>>> -- >>>>>> Rawat >>>>>> >>>>>> >>>>>> On 11/26/2013 12:40 PM, Srivas wrote: >>>>>> > Hi! >>>>>> > I have a bunch of PDF files journals and I need to get >>>>>> the text >>>>>> out of >>>>>> > it. They contain a lot of romanized sanskrit >>>>>> diacritical >>>>>> marks >>>>>> and that >>>>>> > creates a difficulty. I tried Finereader and OmniPage >>>>>> but they >>>>>> cannot be >>>>>> > trained to recognize those symbols. I just need an ORC >>>>>> program I can >>>>>> > train to show any symbol required and the above >>>>>> programs >>>>>> cannot >>>>>> do that. >>>>>> > >>>>>> > Where should I start from? I feel like this program >>>>>> can >>>>>> do the >>>>>> job but >>>>>> > can you help me to get started? I downloaded >>>>>> tesseract and >>>>>> installed it >>>>>> > (windows). There are different GUIs available and I >>>>>> think it will >>>>>> make >>>>>> > it easier to work. Can you suggest a good one? I tried >>>>>> gimagereader but >>>>>> > it's too primitive and leaves a lot of work to be done >>>>>> afterwards >>>>>> with >>>>>> > the overall text. >>>>>> > >>>>>> > I don't think this kind of language pack is available >>>>>> and how to >>>>>> create it? >>>>>> > >>>>>> > I will add one pdf and fonts that were used to create >>>>>> it. Maybe >>>>>> someone >>>>>> > would like to try and let me know how to do it? >>>>>> > >>>>>> > Thank you for any help! >>>>>> > >>>>>> > Regards, >>>>>> > Srivas >>>>>> >>>>>> >>>>> -- >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To post to this group, send email to [email protected] >>>>> >>>>> To unsubscribe from this group, send email to >>>>> [email protected] >>>>> >>>>> For more options, visit this group at >>>>> http://groups.google.com/group/tesseract-ocr?hl=en >>>>> >>>>> --- You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> >>>>> For more options, visit https://groups.google.com/groups/opt_out. >>>>> >>>> >>>> -- >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To post to this group, send email to [email protected] >>>> >>>> To unsubscribe from this group, send email to >>>> [email protected] >>>> >>>> For more options, visit this group at >>>> http://groups.google.com/group/tesseract-ocr?hl=en >>>> >>>> --- >>>> You received this message because you are subscribed to a topic in the >>>> Google Groups "tesseract-ocr" group. >>>> To unsubscribe from this topic, visit https://groups.google.com/d/ >>>> topic/tesseract-ocr/6uG7HUxLY7w/unsubscribe. >>>> To unsubscribe from this group and all its topics, send an email to >>>> [email protected]. >>>> >>>> For more options, visit https://groups.google.com/groups/opt_out. >>>> >>> >>> -- >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To post to this group, send email to [email protected] >>> >>> To unsubscribe from this group, send email to >>> [email protected] >>> >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en >>> >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> >>> For more options, visit https://groups.google.com/groups/opt_out. >>> >> >> -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to a topic in the > Google Groups "tesseract-ocr" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/tesseract-ocr/6uG7HUxLY7w/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

