Thank you both for your help. This letter replacement is a good idea! Looks
like this sed script will do the work. I will just have to see how to use
sed... Tomorrow I will check it out.


On Wed, Nov 27, 2013 at 8:20 PM, Shree Devi Kumar <[email protected]>wrote:

> I think rather than try to OCR, please extract the text and then run a
> conversion script to change the letters with diacritical marks.
>
> eg. you would do the following substitution using sed for the sample text
> from page 11
>
> s/Å/Ā/g
> s/å/ā/g
> s/®/ṛ/g
> s/ß/ṣ/g
> s/∫/ṇ/g
> s/î/ī/g
> s/Ê/Ī/g
> s/¸/Ś/g
> s/Ω/ś/g
> s/ü/ū/g
>
> Also attaching sed script as a utf-8 text file.
>
> Shree Devi Kumar
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
>
> On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat <[email protected]> wrote:
>
>> those Ā á character are defined in Garamond font, but the ASCII code used
>> in this document is not the same as defined in Garamond font.
>>
>> So, it is some other font where these ASCII codes have been defined for
>> this character.
>>
>> The document list a dozen fonts, some of it might be that. you need to
>> figure out which font it could be, by hammer hit trial error method.
>>
>> Thanks.
>> --
>> Rawat
>>
>>
>> On 11/27/2013 3:17 PM, Jaanus Henno wrote:
>>
>>> Ok, you can try page 11. There is glossary and lots of words with
>>> diacritics. Thanks.
>>>
>>>
>>> On Wed, Nov 27, 2013 at 4:41 PM, V S Rawat <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>>
>>>     "words with sanskrit transliteration marks are used"
>>>
>>>     could you please point out exact pages where to look for it. I will
>>>     try to ocr it and see the results.
>>>
>>>     Also,
>>>     http://www.omkarananda-ashram.__org/Sanskrit/itranslator99._
>>> _htm#downloads
>>>
>>>     <http://www.omkarananda-ashram.org/Sanskrit/
>>> itranslator99.htm#downloads>
>>>
>>>     The above page and several links from that page also have a lot of
>>>     Sanskrit fonts. Maybe someone might be used by you.
>>>
>>>     Thanks.
>>>     --
>>>     Rawat
>>>
>>>
>>>     On 11/27/2013 9:16 AM, Srivas wrote:
>>>
>>>         Hi Rawat!
>>>
>>>         I'm really sorry, I didn't know that this is a mailing list type
>>> of
>>>         forum ;-(
>>>
>>>         Second, if you look carefully, you will see that the text is not
>>>         entirely english. In many places words with sanskrit
>>> transliteration
>>>         marks are used. But as you said, it can actually copy/pasted and
>>> it
>>>         didn't even come to my mind! So this part is actually working
>>>         and that
>>>         is great! So I am almost there. The remaining problem is another
>>>         type.
>>>         The provided tamalten font will display the marks, but I need to
>>> use
>>>         another font to display the final document. It also contains the
>>>         same
>>>         diacritical marks but uses another encoding. But this might be a
>>>         question to another person, I know the author of the fonts, I
>>>         will ask
>>>         him. Thanks for the help!
>>>
>>>         Btw. If anyone needs to use sanskrit transliterated fonts, here
>>>         are the
>>>         resources: http://www.krishna-das.com/__ksyberspace/fonts/
>>>
>>>         <http://www.krishna-das.com/ksyberspace/fonts/>
>>>
>>>         On Tuesday, November 26, 2013 4:47:11 PM UTC+7, V S Rawat wrote:
>>>
>>>              Dear Sir Srivas ji,
>>>
>>>              firstly, you should not have sent 2.2 MB 68 page pdf file
>>>         and 181 KB
>>>              zip
>>>              to all the list members unasked. You could have loaded it
>>>         somewhere and
>>>              sent the link so that only those download it who can
>>>         contribute in it.
>>>              It is a wastage of time and bandwidth to get such huge
>>>         messages.
>>>
>>>              Secondly, I couldn't really understand your issue. I saw
>>>         your pdf file.
>>>              it is pure English. You can open it in any pdf reader and
>>>         just copy
>>>              entire text from there and paste in a text or word file.
>>>         So, what else
>>>              exactly you are looking for, please elaborate.
>>>
>>>              you don't even need to ocr it. These are already ASCII text.
>>>
>>>              Thanks.
>>>              --
>>>              Rawat
>>>
>>>
>>>              On 11/26/2013 12:40 PM, Srivas wrote:
>>>               > Hi!
>>>               > I have a bunch of PDF files journals and I need to get
>>>         the text
>>>              out of
>>>               > it. They contain a lot of romanized sanskrit diacritical
>>>         marks
>>>              and that
>>>               > creates a difficulty. I tried Finereader and OmniPage
>>>         but they
>>>              cannot be
>>>               > trained to recognize those symbols. I just need an ORC
>>>         program I can
>>>               > train to show any symbol required and the above programs
>>>         cannot
>>>              do that.
>>>               >
>>>               > Where should I start from? I feel like this program can
>>>         do the
>>>              job but
>>>               > can you help me to get started? I downloaded tesseract
>>> and
>>>              installed it
>>>               > (windows). There are different GUIs available and I
>>>         think it will
>>>              make
>>>               > it easier to work. Can you suggest a good one? I tried
>>>              gimagereader but
>>>               > it's too primitive and leaves a lot of work to be done
>>>         afterwards
>>>              with
>>>               > the overall text.
>>>               >
>>>               > I don't think this kind of language pack is available
>>>         and how to
>>>              create it?
>>>               >
>>>               > I will add one pdf and fonts that were used to create
>>>         it. Maybe
>>>              someone
>>>               > would like to try and let me know how to do it?
>>>               >
>>>               > Thank you for any help!
>>>               >
>>>               > Regards,
>>>               > Srivas
>>>
>>>
>> --
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]
>> To unsubscribe from this group, send email to
>> [email protected]
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>> --- You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>>
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>  --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/6uG7HUxLY7w/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to