Ok, somehow I missed last part of the conversation. I will try out those
Windows based options you mentioned for sed.


On Thu, Nov 28, 2013 at 10:13 AM, Srivas <[email protected]> wrote:

> I'm a little new to all that. How do you run sed under Windows 7? I read
> information about it and that it can also be run under windows but cannot
> understand how to do that.
>
>
> On Wednesday, November 27, 2013 9:11:01 PM UTC+7, shree wrote:
>
>> sed -f roman.sed inputfile.txt > outputfile.txt
>>
>> You will have to add other substitutions to the file roman.sed - it only
>> has the first few substitutions that I encountered.
>>
>> Shree Devi Kumar
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>>
>> On Wed, Nov 27, 2013 at 7:08 PM, Jaanus Henno <[email protected]>wrote:
>>
>>>  Thank you both for your help. This letter replacement is a good idea!
>>> Looks like this sed script will do the work. I will just have to see how to
>>> use sed... Tomorrow I will check it out.
>>>
>>>
>>> On Wed, Nov 27, 2013 at 8:20 PM, Shree Devi Kumar <[email protected]>wrote:
>>>
>>>> I think rather than try to OCR, please extract the text and then run a
>>>> conversion script to change the letters with diacritical marks.
>>>>
>>>> eg. you would do the following substitution using sed for the sample
>>>> text from page 11
>>>>
>>>> s/Å/Ā/g
>>>> s/å/ā/g
>>>> s/®/ṛ/g
>>>> s/ß/ṣ/g
>>>> s/∫/ṇ/g
>>>> s/î/ī/g
>>>> s/Ê/Ī/g
>>>> s/¸/Ś/g
>>>> s/Ω/ś/g
>>>> s/ü/ū/g
>>>>
>>>> Also attaching sed script as a utf-8 text file.
>>>>
>>>> Shree Devi Kumar
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>>
>>>> On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat <[email protected]> wrote:
>>>>
>>>>> those Ā á character are defined in Garamond font, but the ASCII code
>>>>> used in this document is not the same as defined in Garamond font.
>>>>>
>>>>> So, it is some other font where these ASCII codes have been defined
>>>>> for this character.
>>>>>
>>>>> The document list a dozen fonts, some of it might be that. you need to
>>>>> figure out which font it could be, by hammer hit trial error method.
>>>>>
>>>>> Thanks.
>>>>> --
>>>>> Rawat
>>>>>
>>>>>
>>>>> On 11/27/2013 3:17 PM, Jaanus Henno wrote:
>>>>>
>>>>>> Ok, you can try page 11. There is glossary and lots of words with
>>>>>> diacritics. Thanks.
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 27, 2013 at 4:41 PM, V S Rawat <[email protected]
>>>>>> <mailto:[email protected]>> wrote:
>>>>>>
>>>>>>
>>>>>>     "words with sanskrit transliteration marks are used"
>>>>>>
>>>>>>     could you please point out exact pages where to look for it. I
>>>>>> will
>>>>>>     try to ocr it and see the results.
>>>>>>
>>>>>>     Also,
>>>>>>     http://www.omkarananda-ashram.__org/Sanskrit/itranslator99._
>>>>>> _htm#downloads
>>>>>>
>>>>>>     <http://www.omkarananda-ashram.org/Sanskrit/itranslator99.
>>>>>> htm#downloads>
>>>>>>
>>>>>>     The above page and several links from that page also have a lot of
>>>>>>     Sanskrit fonts. Maybe someone might be used by you.
>>>>>>
>>>>>>     Thanks.
>>>>>>     --
>>>>>>     Rawat
>>>>>>
>>>>>>
>>>>>>     On 11/27/2013 9:16 AM, Srivas wrote:
>>>>>>
>>>>>>         Hi Rawat!
>>>>>>
>>>>>>         I'm really sorry, I didn't know that this is a mailing list
>>>>>> type of
>>>>>>         forum ;-(
>>>>>>
>>>>>>         Second, if you look carefully, you will see that the text is
>>>>>> not
>>>>>>         entirely english. In many places words with sanskrit
>>>>>> transliteration
>>>>>>         marks are used. But as you said, it can actually copy/pasted
>>>>>> and it
>>>>>>         didn't even come to my mind! So this part is actually working
>>>>>>         and that
>>>>>>         is great! So I am almost there. The remaining problem is
>>>>>> another
>>>>>>         type.
>>>>>>         The provided tamalten font will display the marks, but I need
>>>>>> to use
>>>>>>         another font to display the final document. It also contains
>>>>>> the
>>>>>>         same
>>>>>>         diacritical marks but uses another encoding. But this might
>>>>>> be a
>>>>>>         question to another person, I know the author of the fonts, I
>>>>>>         will ask
>>>>>>         him. Thanks for the help!
>>>>>>
>>>>>>         Btw. If anyone needs to use sanskrit transliterated fonts,
>>>>>> here
>>>>>>         are the
>>>>>>         resources: http://www.krishna-das.com/__ksyberspace/fonts/
>>>>>>
>>>>>>         <http://www.krishna-das.com/ksyberspace/fonts/>
>>>>>>
>>>>>>         On Tuesday, November 26, 2013 4:47:11 PM UTC+7, V S Rawat
>>>>>> wrote:
>>>>>>
>>>>>>              Dear Sir Srivas ji,
>>>>>>
>>>>>>              firstly, you should not have sent 2.2 MB 68 page pdf file
>>>>>>         and 181 KB
>>>>>>              zip
>>>>>>              to all the list members unasked. You could have loaded it
>>>>>>         somewhere and
>>>>>>              sent the link so that only those download it who can
>>>>>>         contribute in it.
>>>>>>              It is a wastage of time and bandwidth to get such huge
>>>>>>         messages.
>>>>>>
>>>>>>              Secondly, I couldn't really understand your issue. I saw
>>>>>>         your pdf file.
>>>>>>              it is pure English. You can open it in any pdf reader and
>>>>>>         just copy
>>>>>>              entire text from there and paste in a text or word file.
>>>>>>         So, what else
>>>>>>              exactly you are looking for, please elaborate.
>>>>>>
>>>>>>              you don't even need to ocr it. These are already ASCII
>>>>>> text.
>>>>>>
>>>>>>              Thanks.
>>>>>>              --
>>>>>>              Rawat
>>>>>>
>>>>>>
>>>>>>              On 11/26/2013 12:40 PM, Srivas wrote:
>>>>>>               > Hi!
>>>>>>               > I have a bunch of PDF files journals and I need to get
>>>>>>         the text
>>>>>>              out of
>>>>>>               > it. They contain a lot of romanized sanskrit
>>>>>> diacritical
>>>>>>         marks
>>>>>>              and that
>>>>>>               > creates a difficulty. I tried Finereader and OmniPage
>>>>>>         but they
>>>>>>              cannot be
>>>>>>               > trained to recognize those symbols. I just need an ORC
>>>>>>         program I can
>>>>>>               > train to show any symbol required and the above
>>>>>> programs
>>>>>>         cannot
>>>>>>              do that.
>>>>>>               >
>>>>>>               > Where should I start from? I feel like this program
>>>>>> can
>>>>>>         do the
>>>>>>              job but
>>>>>>               > can you help me to get started? I downloaded
>>>>>> tesseract and
>>>>>>              installed it
>>>>>>               > (windows). There are different GUIs available and I
>>>>>>         think it will
>>>>>>              make
>>>>>>               > it easier to work. Can you suggest a good one? I tried
>>>>>>              gimagereader but
>>>>>>               > it's too primitive and leaves a lot of work to be done
>>>>>>         afterwards
>>>>>>              with
>>>>>>               > the overall text.
>>>>>>               >
>>>>>>               > I don't think this kind of language pack is available
>>>>>>         and how to
>>>>>>              create it?
>>>>>>               >
>>>>>>               > I will add one pdf and fonts that were used to create
>>>>>>         it. Maybe
>>>>>>              someone
>>>>>>               > would like to try and let me know how to do it?
>>>>>>               >
>>>>>>               > Thank you for any help!
>>>>>>               >
>>>>>>               > Regards,
>>>>>>               > Srivas
>>>>>>
>>>>>>
>>>>> --
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To post to this group, send email to [email protected]
>>>>>
>>>>> To unsubscribe from this group, send email to
>>>>> [email protected]
>>>>>
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>>>
>>>>> --- You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>>
>>>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>>>
>>>>
>>>>  --
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To post to this group, send email to [email protected]
>>>>
>>>> To unsubscribe from this group, send email to
>>>> [email protected]
>>>>
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>>
>>>> ---
>>>> You received this message because you are subscribed to a topic in the
>>>> Google Groups "tesseract-ocr" group.
>>>> To unsubscribe from this topic, visit https://groups.google.com/d/
>>>> topic/tesseract-ocr/6uG7HUxLY7w/unsubscribe.
>>>>  To unsubscribe from this group and all its topics, send an email to
>>>> [email protected].
>>>>
>>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>>
>>>
>>>  --
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To post to this group, send email to [email protected]
>>>
>>> To unsubscribe from this group, send email to
>>> [email protected]
>>>
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>>
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>
>>  --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/6uG7HUxLY7w/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to