aaa.DangAmbigs.txt is user-defined file used by VietOCR in post-processing 
(post-OCR) corrections.

On Thursday, January 9, 2014 12:57:17 PM UTC-6, Ravi Roshan wrote:
>
> Please tell me where I could find this " hin.DangAmbigs.txt" file.
> Thank you.
>
>
> On Wednesday, 27 November 2013 21:29:43 UTC+5:30, V S Rawat wrote:
>>
>> That is very convenient solution, Shree Devi ji. 
>>
>> However, if sed or other "substitutors" are not there, or if one wants 
>> to avoid using them, I think it can be done using built in 
>> post-processing method of tesseract. 
>>
>> use san.DangAmbigs.txt or hin.DangAmbigs.txt whichever language you are 
>> using. 
>>
>> then put them as 
>> Å=Ā 
>> one per line. 
>>
>> Should it work equally well and automatically, without needing manual 
>> step? 
>>
>> if so, then, Shree Devi ji, is there any major benefit of post 
>> processing in sed? 
>>
>> Please remind me where this DangAmbigs file is to be put? 
>>
>> Thanks. 
>> -- 
>> Rawat 
>>
>> On 11/27/2013 6:50 PM, Shree Devi Kumar wrote: 
>> > I think rather than try to OCR, please extract the text and then run a 
>> > conversion script to change the letters with diacritical marks. 
>> > 
>> > eg. you would do the following substitution using sed for the sample 
>> > text from page 11 
>> > 
>> > s/Å/Ā/g 
>> > s/å/ā/g 
>> > s/®/ṛ/g 
>> > s/ß/ṣ/g 
>> > s/∫/ṇ/g 
>> > s/î/ī/g 
>> > s/Ê/Ī/g 
>> > s/¸/Ś/g 
>> > s/Ω/ś/g 
>> > s/ü/ū/g 
>> > 
>> > Also attaching sed script as a utf-8 text file. 
>> > 
>> > Shree Devi Kumar 
>> > ____________________________________________________________ 
>> > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com 
>> > 
>> > 
>> > On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat <[email protected] 
>> > <mailto:[email protected]>> wrote: 
>> > 
>> >     those Ā á character are defined in Garamond font, but the ASCII 
>> code 
>> >     used in this document is not the same as defined in Garamond font. 
>> > 
>> >     So, it is some other font where these ASCII codes have been defined 
>> >     for this character. 
>> > 
>> >     The document list a dozen fonts, some of it might be that. you need 
>> >     to figure out which font it could be, by hammer hit trial error 
>> method. 
>> > 
>> >     Thanks. 
>> >     -- 
>> >     Rawat 
>> > 
>> > 
>> >     On 11/27/2013 3:17 PM, Jaanus Henno wrote: 
>> > 
>> >         Ok, you can try page 11. There is glossary and lots of words 
>> with 
>> >         diacritics. Thanks. 
>> > 
>> > 
>> >         On Wed, Nov 27, 2013 at 4:41 PM, V S Rawat <[email protected] 
>> >         <mailto:[email protected]> 
>> >         <mailto:[email protected] <mailto:[email protected]>>> wrote: 
>> > 
>> > 
>> >              "words with sanskrit transliteration marks are used" 
>> > 
>> >              could you please point out exact pages where to look for 
>> >         it. I will 
>> >              try to ocr it and see the results. 
>> > 
>> >              Also, 
>> >         
>> > http://www.omkarananda-ashram.____org/Sanskrit/itranslator99.____htm#downloads
>> >  
>>
>> > 
>> > 
>> >         <
>> http://www.omkarananda-__ashram.org/Sanskrit/__itranslator99.htm#downloads 
>> >         <
>> http://www.omkarananda-ashram.org/Sanskrit/itranslator99.htm#downloads>> 
>> > 
>> >              The above page and several links from that page also have 
>> a 
>> >         lot of 
>> >              Sanskrit fonts. Maybe someone might be used by you. 
>> > 
>> >              Thanks. 
>> >              -- 
>> >              Rawat 
>> > 
>> > 
>> >              On 11/27/2013 9:16 AM, Srivas wrote: 
>> > 
>> >                  Hi Rawat! 
>> > 
>> >                  I'm really sorry, I didn't know that this is a mailing 
>> >         list type of 
>> >                  forum ;-( 
>> > 
>> >                  Second, if you look carefully, you will see that the 
>> >         text is not 
>> >                  entirely english. In many places words with sanskrit 
>> >         transliteration 
>> >                  marks are used. But as you said, it can actually 
>> >         copy/pasted and it 
>> >                  didn't even come to my mind! So this part is actually 
>> >         working 
>> >                  and that 
>> >                  is great! So I am almost there. The remaining problem 
>> >         is another 
>> >                  type. 
>> >                  The provided tamalten font will display the marks, but 
>> >         I need to use 
>> >                  another font to display the final document. It also 
>> >         contains the 
>> >                  same 
>> >                  diacritical marks but uses another encoding. But this 
>> >         might be a 
>> >                  question to another person, I know the author of the 
>> >         fonts, I 
>> >                  will ask 
>> >                  him. Thanks for the help! 
>> > 
>> >                  Btw. If anyone needs to use sanskrit transliterated 
>> >         fonts, here 
>> >                  are the 
>> >                  resources: 
>> >         http://www.krishna-das.com/____ksyberspace/fonts/ 
>> >         <http://www.krishna-das.com/__ksyberspace/fonts/> 
>> > 
>> >                  <http://www.krishna-das.com/__ksyberspace/fonts/ 
>> >         <http://www.krishna-das.com/ksyberspace/fonts/>> 
>> > 
>> >                  On Tuesday, November 26, 2013 4:47:11 PM UTC+7, V S 
>> >         Rawat wrote: 
>> > 
>> >                       Dear Sir Srivas ji, 
>> > 
>> >                       firstly, you should not have sent 2.2 MB 68 page 
>> >         pdf file 
>> >                  and 181 KB 
>> >                       zip 
>> >                       to all the list members unasked. You could have 
>> >         loaded it 
>> >                  somewhere and 
>> >                       sent the link so that only those download it who 
>> can 
>> >                  contribute in it. 
>> >                       It is a wastage of time and bandwidth to get such 
>> huge 
>> >                  messages. 
>> > 
>> >                       Secondly, I couldn't really understand your 
>> issue. 
>> >         I saw 
>> >                  your pdf file. 
>> >                       it is pure English. You can open it in any pdf 
>> >         reader and 
>> >                  just copy 
>> >                       entire text from there and paste in a text or 
>> word 
>> >         file. 
>> >                  So, what else 
>> >                       exactly you are looking for, please elaborate. 
>> > 
>> >                       you don't even need to ocr it. These are already 
>> >         ASCII text. 
>> > 
>> >                       Thanks. 
>> >                       -- 
>> >                       Rawat 
>> > 
>> > 
>> >                       On 11/26/2013 12:40 PM, Srivas wrote: 
>> >                        > Hi! 
>> >                        > I have a bunch of PDF files journals and I 
>> need 
>> >         to get 
>> >                  the text 
>> >                       out of 
>> >                        > it. They contain a lot of romanized sanskrit 
>> >         diacritical 
>> >                  marks 
>> >                       and that 
>> >                        > creates a difficulty. I tried Finereader and 
>> >         OmniPage 
>> >                  but they 
>> >                       cannot be 
>> >                        > trained to recognize those symbols. I just 
>> need 
>> >         an ORC 
>> >                  program I can 
>> >                        > train to show any symbol required and the 
>> above 
>> >         programs 
>> >                  cannot 
>> >                       do that. 
>> >                        > 
>> >                        > Where should I start from? I feel like this 
>> >         program can 
>> >                  do the 
>> >                       job but 
>> >                        > can you help me to get started? I downloaded 
>> >         tesseract and 
>> >                       installed it 
>> >                        > (windows). There are different GUIs available 
>> and I 
>> >                  think it will 
>> >                       make 
>> >                        > it easier to work. Can you suggest a good one? 
>> >         I tried 
>> >                       gimagereader but 
>> >                        > it's too primitive and leaves a lot of work to 
>> >         be done 
>> >                  afterwards 
>> >                       with 
>> >                        > the overall text. 
>> >                        > 
>> >                        > I don't think this kind of language pack is 
>> >         available 
>> >                  and how to 
>> >                       create it? 
>> >                        > 
>> >                        > I will add one pdf and fonts that were used to 
>> >         create 
>> >                  it. Maybe 
>> >                       someone 
>> >                        > would like to try and let me know how to do 
>> it? 
>> >                        > 
>> >                        > Thank you for any help! 
>> >                        > 
>> >                        > Regards, 
>> >                        > Srivas 
>>
>>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to