sed -f roman.sed inputfile.txt > outputfile.txt

You will have to add other substitutions to the file roman.sed - it only
has the first few substitutions that I encountered.

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Wed, Nov 27, 2013 at 7:08 PM, Jaanus Henno <[email protected]>wrote:

> Thank you both for your help. This letter replacement is a good idea!
> Looks like this sed script will do the work. I will just have to see how to
> use sed... Tomorrow I will check it out.
>
>
> On Wed, Nov 27, 2013 at 8:20 PM, Shree Devi Kumar <[email protected]>wrote:
>
>> I think rather than try to OCR, please extract the text and then run a
>> conversion script to change the letters with diacritical marks.
>>
>> eg. you would do the following substitution using sed for the sample text
>> from page 11
>>
>> s/Å/Ā/g
>> s/å/ā/g
>> s/®/ṛ/g
>> s/ß/ṣ/g
>> s/∫/ṇ/g
>> s/î/ī/g
>> s/Ê/Ī/g
>> s/¸/Ś/g
>> s/Ω/ś/g
>> s/ü/ū/g
>>
>> Also attaching sed script as a utf-8 text file.
>>
>> Shree Devi Kumar
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>>
>> On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat <[email protected]> wrote:
>>
>>> those Ā á character are defined in Garamond font, but the ASCII code
>>> used in this document is not the same as defined in Garamond font.
>>>
>>> So, it is some other font where these ASCII codes have been defined for
>>> this character.
>>>
>>> The document list a dozen fonts, some of it might be that. you need to
>>> figure out which font it could be, by hammer hit trial error method.
>>>
>>> Thanks.
>>> --
>>> Rawat
>>>
>>>
>>> On 11/27/2013 3:17 PM, Jaanus Henno wrote:
>>>
>>>> Ok, you can try page 11. There is glossary and lots of words with
>>>> diacritics. Thanks.
>>>>
>>>>
>>>> On Wed, Nov 27, 2013 at 4:41 PM, V S Rawat <[email protected]
>>>> <mailto:[email protected]>> wrote:
>>>>
>>>>
>>>>     "words with sanskrit transliteration marks are used"
>>>>
>>>>     could you please point out exact pages where to look for it. I will
>>>>     try to ocr it and see the results.
>>>>
>>>>     Also,
>>>>     http://www.omkarananda-ashram.__org/Sanskrit/itranslator99._
>>>> _htm#downloads
>>>>
>>>>     <http://www.omkarananda-ashram.org/Sanskrit/
>>>> itranslator99.htm#downloads>
>>>>
>>>>     The above page and several links from that page also have a lot of
>>>>     Sanskrit fonts. Maybe someone might be used by you.
>>>>
>>>>     Thanks.
>>>>     --
>>>>     Rawat
>>>>
>>>>
>>>>     On 11/27/2013 9:16 AM, Srivas wrote:
>>>>
>>>>         Hi Rawat!
>>>>
>>>>         I'm really sorry, I didn't know that this is a mailing list
>>>> type of
>>>>         forum ;-(
>>>>
>>>>         Second, if you look carefully, you will see that the text is not
>>>>         entirely english. In many places words with sanskrit
>>>> transliteration
>>>>         marks are used. But as you said, it can actually copy/pasted
>>>> and it
>>>>         didn't even come to my mind! So this part is actually working
>>>>         and that
>>>>         is great! So I am almost there. The remaining problem is another
>>>>         type.
>>>>         The provided tamalten font will display the marks, but I need
>>>> to use
>>>>         another font to display the final document. It also contains the
>>>>         same
>>>>         diacritical marks but uses another encoding. But this might be a
>>>>         question to another person, I know the author of the fonts, I
>>>>         will ask
>>>>         him. Thanks for the help!
>>>>
>>>>         Btw. If anyone needs to use sanskrit transliterated fonts, here
>>>>         are the
>>>>         resources: http://www.krishna-das.com/__ksyberspace/fonts/
>>>>
>>>>         <http://www.krishna-das.com/ksyberspace/fonts/>
>>>>
>>>>         On Tuesday, November 26, 2013 4:47:11 PM UTC+7, V S Rawat wrote:
>>>>
>>>>              Dear Sir Srivas ji,
>>>>
>>>>              firstly, you should not have sent 2.2 MB 68 page pdf file
>>>>         and 181 KB
>>>>              zip
>>>>              to all the list members unasked. You could have loaded it
>>>>         somewhere and
>>>>              sent the link so that only those download it who can
>>>>         contribute in it.
>>>>              It is a wastage of time and bandwidth to get such huge
>>>>         messages.
>>>>
>>>>              Secondly, I couldn't really understand your issue. I saw
>>>>         your pdf file.
>>>>              it is pure English. You can open it in any pdf reader and
>>>>         just copy
>>>>              entire text from there and paste in a text or word file.
>>>>         So, what else
>>>>              exactly you are looking for, please elaborate.
>>>>
>>>>              you don't even need to ocr it. These are already ASCII
>>>> text.
>>>>
>>>>              Thanks.
>>>>              --
>>>>              Rawat
>>>>
>>>>
>>>>              On 11/26/2013 12:40 PM, Srivas wrote:
>>>>               > Hi!
>>>>               > I have a bunch of PDF files journals and I need to get
>>>>         the text
>>>>              out of
>>>>               > it. They contain a lot of romanized sanskrit diacritical
>>>>         marks
>>>>              and that
>>>>               > creates a difficulty. I tried Finereader and OmniPage
>>>>         but they
>>>>              cannot be
>>>>               > trained to recognize those symbols. I just need an ORC
>>>>         program I can
>>>>               > train to show any symbol required and the above programs
>>>>         cannot
>>>>              do that.
>>>>               >
>>>>               > Where should I start from? I feel like this program can
>>>>         do the
>>>>              job but
>>>>               > can you help me to get started? I downloaded tesseract
>>>> and
>>>>              installed it
>>>>               > (windows). There are different GUIs available and I
>>>>         think it will
>>>>              make
>>>>               > it easier to work. Can you suggest a good one? I tried
>>>>              gimagereader but
>>>>               > it's too primitive and leaves a lot of work to be done
>>>>         afterwards
>>>>              with
>>>>               > the overall text.
>>>>               >
>>>>               > I don't think this kind of language pack is available
>>>>         and how to
>>>>              create it?
>>>>               >
>>>>               > I will add one pdf and fonts that were used to create
>>>>         it. Maybe
>>>>              someone
>>>>               > would like to try and let me know how to do it?
>>>>               >
>>>>               > Thank you for any help!
>>>>               >
>>>>               > Regards,
>>>>               > Srivas
>>>>
>>>>
>>> --
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To post to this group, send email to [email protected]
>>> To unsubscribe from this group, send email to
>>> [email protected]
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>
>>> --- You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>>
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>
>>  --
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]
>> To unsubscribe from this group, send email to
>> [email protected]
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>> ---
>> You received this message because you are subscribed to a topic in the
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/tesseract-ocr/6uG7HUxLY7w/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> [email protected].
>>
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>  --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to