aaa.DangAmbigs.txt is user-defined file used by VietOCR in post-processing (post-OCR) corrections.
On Thursday, January 9, 2014 12:57:17 PM UTC-6, Ravi Roshan wrote: > > Please tell me where I could find this " hin.DangAmbigs.txt" file. > Thank you. > > > On Wednesday, 27 November 2013 21:29:43 UTC+5:30, V S Rawat wrote: >> >> That is very convenient solution, Shree Devi ji. >> >> However, if sed or other "substitutors" are not there, or if one wants >> to avoid using them, I think it can be done using built in >> post-processing method of tesseract. >> >> use san.DangAmbigs.txt or hin.DangAmbigs.txt whichever language you are >> using. >> >> then put them as >> Å=Ā >> one per line. >> >> Should it work equally well and automatically, without needing manual >> step? >> >> if so, then, Shree Devi ji, is there any major benefit of post >> processing in sed? >> >> Please remind me where this DangAmbigs file is to be put? >> >> Thanks. >> -- >> Rawat >> >> On 11/27/2013 6:50 PM, Shree Devi Kumar wrote: >> > I think rather than try to OCR, please extract the text and then run a >> > conversion script to change the letters with diacritical marks. >> > >> > eg. you would do the following substitution using sed for the sample >> > text from page 11 >> > >> > s/Å/Ā/g >> > s/å/ā/g >> > s/®/ṛ/g >> > s/ß/ṣ/g >> > s/∫/ṇ/g >> > s/î/ī/g >> > s/Ê/Ī/g >> > s/¸/Ś/g >> > s/Ω/ś/g >> > s/ü/ū/g >> > >> > Also attaching sed script as a utf-8 text file. >> > >> > Shree Devi Kumar >> > ____________________________________________________________ >> > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> > >> > >> > On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat <[email protected] >> > <mailto:[email protected]>> wrote: >> > >> > those Ā á character are defined in Garamond font, but the ASCII >> code >> > used in this document is not the same as defined in Garamond font. >> > >> > So, it is some other font where these ASCII codes have been defined >> > for this character. >> > >> > The document list a dozen fonts, some of it might be that. you need >> > to figure out which font it could be, by hammer hit trial error >> method. >> > >> > Thanks. >> > -- >> > Rawat >> > >> > >> > On 11/27/2013 3:17 PM, Jaanus Henno wrote: >> > >> > Ok, you can try page 11. There is glossary and lots of words >> with >> > diacritics. Thanks. >> > >> > >> > On Wed, Nov 27, 2013 at 4:41 PM, V S Rawat <[email protected] >> > <mailto:[email protected]> >> > <mailto:[email protected] <mailto:[email protected]>>> wrote: >> > >> > >> > "words with sanskrit transliteration marks are used" >> > >> > could you please point out exact pages where to look for >> > it. I will >> > try to ocr it and see the results. >> > >> > Also, >> > >> > http://www.omkarananda-ashram.____org/Sanskrit/itranslator99.____htm#downloads >> > >> >> > >> > >> > < >> http://www.omkarananda-__ashram.org/Sanskrit/__itranslator99.htm#downloads >> > < >> http://www.omkarananda-ashram.org/Sanskrit/itranslator99.htm#downloads>> >> > >> > The above page and several links from that page also have >> a >> > lot of >> > Sanskrit fonts. Maybe someone might be used by you. >> > >> > Thanks. >> > -- >> > Rawat >> > >> > >> > On 11/27/2013 9:16 AM, Srivas wrote: >> > >> > Hi Rawat! >> > >> > I'm really sorry, I didn't know that this is a mailing >> > list type of >> > forum ;-( >> > >> > Second, if you look carefully, you will see that the >> > text is not >> > entirely english. In many places words with sanskrit >> > transliteration >> > marks are used. But as you said, it can actually >> > copy/pasted and it >> > didn't even come to my mind! So this part is actually >> > working >> > and that >> > is great! So I am almost there. The remaining problem >> > is another >> > type. >> > The provided tamalten font will display the marks, but >> > I need to use >> > another font to display the final document. It also >> > contains the >> > same >> > diacritical marks but uses another encoding. But this >> > might be a >> > question to another person, I know the author of the >> > fonts, I >> > will ask >> > him. Thanks for the help! >> > >> > Btw. If anyone needs to use sanskrit transliterated >> > fonts, here >> > are the >> > resources: >> > http://www.krishna-das.com/____ksyberspace/fonts/ >> > <http://www.krishna-das.com/__ksyberspace/fonts/> >> > >> > <http://www.krishna-das.com/__ksyberspace/fonts/ >> > <http://www.krishna-das.com/ksyberspace/fonts/>> >> > >> > On Tuesday, November 26, 2013 4:47:11 PM UTC+7, V S >> > Rawat wrote: >> > >> > Dear Sir Srivas ji, >> > >> > firstly, you should not have sent 2.2 MB 68 page >> > pdf file >> > and 181 KB >> > zip >> > to all the list members unasked. You could have >> > loaded it >> > somewhere and >> > sent the link so that only those download it who >> can >> > contribute in it. >> > It is a wastage of time and bandwidth to get such >> huge >> > messages. >> > >> > Secondly, I couldn't really understand your >> issue. >> > I saw >> > your pdf file. >> > it is pure English. You can open it in any pdf >> > reader and >> > just copy >> > entire text from there and paste in a text or >> word >> > file. >> > So, what else >> > exactly you are looking for, please elaborate. >> > >> > you don't even need to ocr it. These are already >> > ASCII text. >> > >> > Thanks. >> > -- >> > Rawat >> > >> > >> > On 11/26/2013 12:40 PM, Srivas wrote: >> > > Hi! >> > > I have a bunch of PDF files journals and I >> need >> > to get >> > the text >> > out of >> > > it. They contain a lot of romanized sanskrit >> > diacritical >> > marks >> > and that >> > > creates a difficulty. I tried Finereader and >> > OmniPage >> > but they >> > cannot be >> > > trained to recognize those symbols. I just >> need >> > an ORC >> > program I can >> > > train to show any symbol required and the >> above >> > programs >> > cannot >> > do that. >> > > >> > > Where should I start from? I feel like this >> > program can >> > do the >> > job but >> > > can you help me to get started? I downloaded >> > tesseract and >> > installed it >> > > (windows). There are different GUIs available >> and I >> > think it will >> > make >> > > it easier to work. Can you suggest a good one? >> > I tried >> > gimagereader but >> > > it's too primitive and leaves a lot of work to >> > be done >> > afterwards >> > with >> > > the overall text. >> > > >> > > I don't think this kind of language pack is >> > available >> > and how to >> > create it? >> > > >> > > I will add one pdf and fonts that were used to >> > create >> > it. Maybe >> > someone >> > > would like to try and let me know how to do >> it? >> > > >> > > Thank you for any help! >> > > >> > > Regards, >> > > Srivas >> >> -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

