How do you run sed on Vim?
On Thu, Nov 28, 2013 at 12:53 AM, V S Rawat <[email protected]> wrote: > Yes, for Srivas ji's file text is 100% text, not images, and is 100% > extractable to word/text file by simple copy paste. ocr is just not needed. > > Then, it is good that sed will make the changes without need of ocr. Good > thought. > > I use vim on w8 so, I wouldn't downgrade to sed. he he. just kidding. vim > has sed built in. :-) > > Thanks. > -- > Rawat > > On 11/27/2013 9:50 PM, Shree Devi Kumar wrote: > >> Rawatji, >> >> I was going by the assumption that the text can be easily extracted from >> his pdf by saving as txt. In that case just running the sed script will >> fix the text for the letters with diacritics which were mapped to some >> other letters in the ascii font. >> >> Doing OCR never gives 100% correct result, so to use the OCR output and >> postprocess in this case may not be the best solution. >> >> You could try windows version of sed from >> http://gnuwin32.sourceforge.net/packages/sed.htm >> >> i only tested using one para of text from page 11. >> >> Shree >> >> Shree Devi Kumar >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> >> On Wed, Nov 27, 2013 at 9:29 PM, V S Rawat <[email protected] >> <mailto:[email protected]>> wrote: >> >> That is very convenient solution, Shree Devi ji. >> >> However, if sed or other "substitutors" are not there, or if one >> wants to avoid using them, I think it can be done using built in >> post-processing method of tesseract. >> >> use san.DangAmbigs.txt or hin.DangAmbigs.txt whichever language you >> are using. >> >> then put them as >> Å=Ā >> one per line. >> >> Should it work equally well and automatically, without needing >> manual step? >> >> if so, then, Shree Devi ji, is there any major benefit of post >> processing in sed? >> >> Please remind me where this DangAmbigs file is to be put? >> >> Thanks. >> -- >> Rawat >> >> >> On 11/27/2013 6:50 PM, Shree Devi Kumar wrote: >> >> I think rather than try to OCR, please extract the text and then >> run a >> conversion script to change the letters with diacritical marks. >> >> eg. you would do the following substitution using sed for the >> sample >> text from page 11 >> >> s/Å/Ā/g >> s/å/ā/g >> s/®/ṛ/g >> s/ß/ṣ/g >> s/∫/ṇ/g >> s/î/ī/g >> s/Ê/Ī/g >> s/¸/Ś/g >> s/Ω/ś/g >> s/ü/ū/g >> >> Also attaching sed script as a utf-8 text file. >> >> Shree Devi Kumar >> ______________________________________________________________ >> >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> >> On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat <[email protected] >> <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>> wrote: >> >> those Ā á character are defined in Garamond font, but the >> ASCII code >> used in this document is not the same as defined in >> Garamond font. >> >> So, it is some other font where these ASCII codes have been >> defined >> for this character. >> >> The document list a dozen fonts, some of it might be that. >> you need >> to figure out which font it could be, by hammer hit trial >> error method. >> >> Thanks. >> -- >> Rawat >> >> >> On 11/27/2013 3:17 PM, Jaanus Henno wrote: >> >> Ok, you can try page 11. There is glossary and lots of >> words with >> diacritics. Thanks. >> >> >> On Wed, Nov 27, 2013 at 4:41 PM, V S Rawat >> <[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>> >> <mailto:[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>>> wrote: >> >> >> "words with sanskrit transliteration marks are used" >> >> could you please point out exact pages where to >> look for >> it. I will >> try to ocr it and see the results. >> >> Also, >> http://www.omkarananda-ashram.______org/Sanskrit/__ >> itranslator99.____htm#__downloads >> >> >> >> <http://www.omkarananda-____ashram.org/Sanskrit/____ >> itranslator99.htm#downloads >> <http://www.omkarananda-__ashram.org/Sanskrit/__ >> itranslator99.htm#downloads> >> >> >> >> <http://www.omkarananda-__ashram.org/Sanskrit/__ >> itranslator99.htm#downloads >> <http://www.omkarananda-ashram.org/Sanskrit/ >> itranslator99.htm#downloads>>> >> >> The above page and several links from that page >> also have a >> lot of >> Sanskrit fonts. Maybe someone might be used by you. >> >> Thanks. >> -- >> Rawat >> >> >> On 11/27/2013 9:16 AM, Srivas wrote: >> >> Hi Rawat! >> >> I'm really sorry, I didn't know that this is a >> mailing >> list type of >> forum ;-( >> >> Second, if you look carefully, you will see >> that the >> text is not >> entirely english. In many places words with >> sanskrit >> transliteration >> marks are used. But as you said, it can actually >> copy/pasted and it >> didn't even come to my mind! So this part is >> actually >> working >> and that >> is great! So I am almost there. The remaining >> problem >> is another >> type. >> The provided tamalten font will display the >> marks, but >> I need to use >> another font to display the final document. It >> also >> contains the >> same >> diacritical marks but uses another encoding. >> But this >> might be a >> question to another person, I know the author >> of the >> fonts, I >> will ask >> him. Thanks for the help! >> >> Btw. If anyone needs to use sanskrit >> transliterated >> fonts, here >> are the >> resources: >> http://www.krishna-das.com/______ksyberspace/fonts/ >> <http://www.krishna-das.com/____ksyberspace/fonts/> >> >> <http://www.krishna-das.com/____ksyberspace/fonts/ >> <http://www.krishna-das.com/__ksyberspace/fonts/>> >> >> >> >> <http://www.krishna-das.com/____ksyberspace/fonts/ >> <http://www.krishna-das.com/__ksyberspace/fonts/> >> <http://www.krishna-das.com/__ksyberspace/fonts/ >> <http://www.krishna-das.com/ksyberspace/fonts/>>> >> >> On Tuesday, November 26, 2013 4:47:11 PM >> UTC+7, V S >> Rawat wrote: >> >> Dear Sir Srivas ji, >> >> firstly, you should not have sent 2.2 MB >> 68 page >> pdf file >> and 181 KB >> zip >> to all the list members unasked. You >> could have >> loaded it >> somewhere and >> sent the link so that only those download >> it who can >> contribute in it. >> It is a wastage of time and bandwidth to >> get such huge >> messages. >> >> Secondly, I couldn't really understand >> your issue. >> I saw >> your pdf file. >> it is pure English. You can open it in >> any pdf >> reader and >> just copy >> entire text from there and paste in a >> text or word >> file. >> So, what else >> exactly you are looking for, please >> elaborate. >> >> you don't even need to ocr it. These are >> already >> ASCII text. >> >> Thanks. >> -- >> Rawat >> >> >> On 11/26/2013 12:40 PM, Srivas wrote: >> > Hi! >> > I have a bunch of PDF files journals >> and I need >> to get >> the text >> out of >> > it. They contain a lot of romanized >> sanskrit >> diacritical >> marks >> and that >> > creates a difficulty. I tried >> Finereader and >> OmniPage >> but they >> cannot be >> > trained to recognize those symbols. I >> just need >> an ORC >> program I can >> > train to show any symbol required and >> the above >> programs >> cannot >> do that. >> > >> > Where should I start from? I feel like >> this >> program can >> do the >> job but >> > can you help me to get started? I >> downloaded >> tesseract and >> installed it >> > (windows). There are different GUIs >> available and I >> think it will >> make >> > it easier to work. Can you suggest a >> good one? >> I tried >> gimagereader but >> > it's too primitive and leaves a lot of >> work to >> be done >> afterwards >> with >> > the overall text. >> > >> > I don't think this kind of language >> pack is >> available >> and how to >> create it? >> > >> > I will add one pdf and fonts that were >> used to >> create >> it. Maybe >> someone >> > would like to try and let me know how >> to do it? >> > >> > Thank you for any help! >> > >> > Regards, >> > Srivas >> >> >> > -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- You received this message because you are subscribed to a topic in the > Google Groups "tesseract-ocr" group. > To unsubscribe from this topic, visit https://groups.google.com/d/ > topic/tesseract-ocr/6uG7HUxLY7w/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

