That would be great, however I still cannot import pdf into VietOcr. Of course, there are other GUIs to do the work but this one looks nice. I already wrote the author of the program about it. As soon as this will be solved, I will post it here also.
On Wednesday, November 27, 2013 10:59:43 PM UTC+7, V S Rawat wrote: > > That is very convenient solution, Shree Devi ji. > > However, if sed or other "substitutors" are not there, or if one wants > to avoid using them, I think it can be done using built in > post-processing method of tesseract. > > use san.DangAmbigs.txt or hin.DangAmbigs.txt whichever language you are > using. > > then put them as > Å=Ā > one per line. > > Should it work equally well and automatically, without needing manual > step? > > if so, then, Shree Devi ji, is there any major benefit of post > processing in sed? > > Please remind me where this DangAmbigs file is to be put? > > Thanks. > -- > Rawat > > On 11/27/2013 6:50 PM, Shree Devi Kumar wrote: > > I think rather than try to OCR, please extract the text and then run a > > conversion script to change the letters with diacritical marks. > > > > eg. you would do the following substitution using sed for the sample > > text from page 11 > > > > s/Å/Ā/g > > s/å/ā/g > > s/®/ṛ/g > > s/ß/ṣ/g > > s/∫/ṇ/g > > s/î/ī/g > > s/Ê/Ī/g > > s/¸/Ś/g > > s/Ω/ś/g > > s/ü/ū/g > > > > Also attaching sed script as a utf-8 text file. > > > > Shree Devi Kumar > > ____________________________________________________________ > > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > > > > > On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat <[email protected]<javascript:> > > <mailto:[email protected] <javascript:>>> wrote: > > > > those Ā á character are defined in Garamond font, but the ASCII code > > used in this document is not the same as defined in Garamond font. > > > > So, it is some other font where these ASCII codes have been defined > > for this character. > > > > The document list a dozen fonts, some of it might be that. you need > > to figure out which font it could be, by hammer hit trial error > method. > > > > Thanks. > > -- > > Rawat > > > > > > On 11/27/2013 3:17 PM, Jaanus Henno wrote: > > > > Ok, you can try page 11. There is glossary and lots of words > with > > diacritics. Thanks. > > > > > > On Wed, Nov 27, 2013 at 4:41 PM, V S Rawat > > <[email protected]<javascript:> > > <mailto:[email protected] <javascript:>> > > <mailto:[email protected] <javascript:> > > <mailto:[email protected]<javascript:>>>> > wrote: > > > > > > "words with sanskrit transliteration marks are used" > > > > could you please point out exact pages where to look for > > it. I will > > try to ocr it and see the results. > > > > Also, > > > > http://www.omkarananda-ashram.____org/Sanskrit/itranslator99.____htm#downloads > > > > > > > > > < > http://www.omkarananda-__ashram.org/Sanskrit/__itranslator99.htm#downloads > > < > http://www.omkarananda-ashram.org/Sanskrit/itranslator99.htm#downloads>> > > > > The above page and several links from that page also have a > > lot of > > Sanskrit fonts. Maybe someone might be used by you. > > > > Thanks. > > -- > > Rawat > > > > > > On 11/27/2013 9:16 AM, Srivas wrote: > > > > Hi Rawat! > > > > I'm really sorry, I didn't know that this is a mailing > > list type of > > forum ;-( > > > > Second, if you look carefully, you will see that the > > text is not > > entirely english. In many places words with sanskrit > > transliteration > > marks are used. But as you said, it can actually > > copy/pasted and it > > didn't even come to my mind! So this part is actually > > working > > and that > > is great! So I am almost there. The remaining problem > > is another > > type. > > The provided tamalten font will display the marks, but > > I need to use > > another font to display the final document. It also > > contains the > > same > > diacritical marks but uses another encoding. But this > > might be a > > question to another person, I know the author of the > > fonts, I > > will ask > > him. Thanks for the help! > > > > Btw. If anyone needs to use sanskrit transliterated > > fonts, here > > are the > > resources: > > http://www.krishna-das.com/____ksyberspace/fonts/ > > <http://www.krishna-das.com/__ksyberspace/fonts/> > > > > <http://www.krishna-das.com/__ksyberspace/fonts/ > > <http://www.krishna-das.com/ksyberspace/fonts/>> > > > > On Tuesday, November 26, 2013 4:47:11 PM UTC+7, V S > > Rawat wrote: > > > > Dear Sir Srivas ji, > > > > firstly, you should not have sent 2.2 MB 68 page > > pdf file > > and 181 KB > > zip > > to all the list members unasked. You could have > > loaded it > > somewhere and > > sent the link so that only those download it who > can > > contribute in it. > > It is a wastage of time and bandwidth to get such > huge > > messages. > > > > Secondly, I couldn't really understand your issue. > > I saw > > your pdf file. > > it is pure English. You can open it in any pdf > > reader and > > just copy > > entire text from there and paste in a text or word > > file. > > So, what else > > exactly you are looking for, please elaborate. > > > > you don't even need to ocr it. These are already > > ASCII text. > > > > Thanks. > > -- > > Rawat > > > > > > On 11/26/2013 12:40 PM, Srivas wrote: > > > Hi! > > > I have a bunch of PDF files journals and I need > > to get > > the text > > out of > > > it. They contain a lot of romanized sanskrit > > diacritical > > marks > > and that > > > creates a difficulty. I tried Finereader and > > OmniPage > > but they > > cannot be > > > trained to recognize those symbols. I just need > > an ORC > > program I can > > > train to show any symbol required and the above > > programs > > cannot > > do that. > > > > > > Where should I start from? I feel like this > > program can > > do the > > job but > > > can you help me to get started? I downloaded > > tesseract and > > installed it > > > (windows). There are different GUIs available > and I > > think it will > > make > > > it easier to work. Can you suggest a good one? > > I tried > > gimagereader but > > > it's too primitive and leaves a lot of work to > > be done > > afterwards > > with > > > the overall text. > > > > > > I don't think this kind of language pack is > > available > > and how to > > create it? > > > > > > I will add one pdf and fonts that were used to > > create > > it. Maybe > > someone > > > would like to try and let me know how to do it? > > > > > > Thank you for any help! > > > > > > Regards, > > > Srivas > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

