That would be great, however I still cannot import pdf into VietOcr. Of 
course, there are other GUIs to do the work but this one looks nice. I 
already wrote the author of the program about it. As soon as this will be 
solved, I will post it here also. 

On Wednesday, November 27, 2013 10:59:43 PM UTC+7, V S Rawat wrote:
>
> That is very convenient solution, Shree Devi ji. 
>
> However, if sed or other "substitutors" are not there, or if one wants 
> to avoid using them, I think it can be done using built in 
> post-processing method of tesseract. 
>
> use san.DangAmbigs.txt or hin.DangAmbigs.txt whichever language you are 
> using. 
>
> then put them as 
> Å=Ā 
> one per line. 
>
> Should it work equally well and automatically, without needing manual 
> step? 
>
> if so, then, Shree Devi ji, is there any major benefit of post 
> processing in sed? 
>
> Please remind me where this DangAmbigs file is to be put? 
>
> Thanks. 
> -- 
> Rawat 
>
> On 11/27/2013 6:50 PM, Shree Devi Kumar wrote: 
> > I think rather than try to OCR, please extract the text and then run a 
> > conversion script to change the letters with diacritical marks. 
> > 
> > eg. you would do the following substitution using sed for the sample 
> > text from page 11 
> > 
> > s/Å/Ā/g 
> > s/å/ā/g 
> > s/®/ṛ/g 
> > s/ß/ṣ/g 
> > s/∫/ṇ/g 
> > s/î/ī/g 
> > s/Ê/Ī/g 
> > s/¸/Ś/g 
> > s/Ω/ś/g 
> > s/ü/ū/g 
> > 
> > Also attaching sed script as a utf-8 text file. 
> > 
> > Shree Devi Kumar 
> > ____________________________________________________________ 
> > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com 
> > 
> > 
> > On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat <[email protected]<javascript:> 
> > <mailto:[email protected] <javascript:>>> wrote: 
> > 
> >     those Ā á character are defined in Garamond font, but the ASCII code 
> >     used in this document is not the same as defined in Garamond font. 
> > 
> >     So, it is some other font where these ASCII codes have been defined 
> >     for this character. 
> > 
> >     The document list a dozen fonts, some of it might be that. you need 
> >     to figure out which font it could be, by hammer hit trial error 
> method. 
> > 
> >     Thanks. 
> >     -- 
> >     Rawat 
> > 
> > 
> >     On 11/27/2013 3:17 PM, Jaanus Henno wrote: 
> > 
> >         Ok, you can try page 11. There is glossary and lots of words 
> with 
> >         diacritics. Thanks. 
> > 
> > 
> >         On Wed, Nov 27, 2013 at 4:41 PM, V S Rawat 
> > <[email protected]<javascript:> 
> >         <mailto:[email protected] <javascript:>> 
> >         <mailto:[email protected] <javascript:> 
> > <mailto:[email protected]<javascript:>>>> 
> wrote: 
> > 
> > 
> >              "words with sanskrit transliteration marks are used" 
> > 
> >              could you please point out exact pages where to look for 
> >         it. I will 
> >              try to ocr it and see the results. 
> > 
> >              Also, 
> >         
> > http://www.omkarananda-ashram.____org/Sanskrit/itranslator99.____htm#downloads
> >  
>
> > 
> > 
> >         <
> http://www.omkarananda-__ashram.org/Sanskrit/__itranslator99.htm#downloads 
> >         <
> http://www.omkarananda-ashram.org/Sanskrit/itranslator99.htm#downloads>> 
> > 
> >              The above page and several links from that page also have a 
> >         lot of 
> >              Sanskrit fonts. Maybe someone might be used by you. 
> > 
> >              Thanks. 
> >              -- 
> >              Rawat 
> > 
> > 
> >              On 11/27/2013 9:16 AM, Srivas wrote: 
> > 
> >                  Hi Rawat! 
> > 
> >                  I'm really sorry, I didn't know that this is a mailing 
> >         list type of 
> >                  forum ;-( 
> > 
> >                  Second, if you look carefully, you will see that the 
> >         text is not 
> >                  entirely english. In many places words with sanskrit 
> >         transliteration 
> >                  marks are used. But as you said, it can actually 
> >         copy/pasted and it 
> >                  didn't even come to my mind! So this part is actually 
> >         working 
> >                  and that 
> >                  is great! So I am almost there. The remaining problem 
> >         is another 
> >                  type. 
> >                  The provided tamalten font will display the marks, but 
> >         I need to use 
> >                  another font to display the final document. It also 
> >         contains the 
> >                  same 
> >                  diacritical marks but uses another encoding. But this 
> >         might be a 
> >                  question to another person, I know the author of the 
> >         fonts, I 
> >                  will ask 
> >                  him. Thanks for the help! 
> > 
> >                  Btw. If anyone needs to use sanskrit transliterated 
> >         fonts, here 
> >                  are the 
> >                  resources: 
> >         http://www.krishna-das.com/____ksyberspace/fonts/ 
> >         <http://www.krishna-das.com/__ksyberspace/fonts/> 
> > 
> >                  <http://www.krishna-das.com/__ksyberspace/fonts/ 
> >         <http://www.krishna-das.com/ksyberspace/fonts/>> 
> > 
> >                  On Tuesday, November 26, 2013 4:47:11 PM UTC+7, V S 
> >         Rawat wrote: 
> > 
> >                       Dear Sir Srivas ji, 
> > 
> >                       firstly, you should not have sent 2.2 MB 68 page 
> >         pdf file 
> >                  and 181 KB 
> >                       zip 
> >                       to all the list members unasked. You could have 
> >         loaded it 
> >                  somewhere and 
> >                       sent the link so that only those download it who 
> can 
> >                  contribute in it. 
> >                       It is a wastage of time and bandwidth to get such 
> huge 
> >                  messages. 
> > 
> >                       Secondly, I couldn't really understand your issue. 
> >         I saw 
> >                  your pdf file. 
> >                       it is pure English. You can open it in any pdf 
> >         reader and 
> >                  just copy 
> >                       entire text from there and paste in a text or word 
> >         file. 
> >                  So, what else 
> >                       exactly you are looking for, please elaborate. 
> > 
> >                       you don't even need to ocr it. These are already 
> >         ASCII text. 
> > 
> >                       Thanks. 
> >                       -- 
> >                       Rawat 
> > 
> > 
> >                       On 11/26/2013 12:40 PM, Srivas wrote: 
> >                        > Hi! 
> >                        > I have a bunch of PDF files journals and I need 
> >         to get 
> >                  the text 
> >                       out of 
> >                        > it. They contain a lot of romanized sanskrit 
> >         diacritical 
> >                  marks 
> >                       and that 
> >                        > creates a difficulty. I tried Finereader and 
> >         OmniPage 
> >                  but they 
> >                       cannot be 
> >                        > trained to recognize those symbols. I just need 
> >         an ORC 
> >                  program I can 
> >                        > train to show any symbol required and the above 
> >         programs 
> >                  cannot 
> >                       do that. 
> >                        > 
> >                        > Where should I start from? I feel like this 
> >         program can 
> >                  do the 
> >                       job but 
> >                        > can you help me to get started? I downloaded 
> >         tesseract and 
> >                       installed it 
> >                        > (windows). There are different GUIs available 
> and I 
> >                  think it will 
> >                       make 
> >                        > it easier to work. Can you suggest a good one? 
> >         I tried 
> >                       gimagereader but 
> >                        > it's too primitive and leaves a lot of work to 
> >         be done 
> >                  afterwards 
> >                       with 
> >                        > the overall text. 
> >                        > 
> >                        > I don't think this kind of language pack is 
> >         available 
> >                  and how to 
> >                       create it? 
> >                        > 
> >                        > I will add one pdf and fonts that were used to 
> >         create 
> >                  it. Maybe 
> >                       someone 
> >                        > would like to try and let me know how to do it? 
> >                        > 
> >                        > Thank you for any help! 
> >                        > 
> >                        > Regards, 
> >                        > Srivas 
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to