Rawatji,

I was going by the assumption that the text can be easily extracted from
his pdf by saving as txt. In that case just running the sed script will fix
the text for the letters  with diacritics which were mapped to some other
letters in the ascii font.

Doing OCR never gives 100% correct result, so to use the OCR output and
postprocess in this case may not be the best solution.

You could try windows version of sed from
http://gnuwin32.sourceforge.net/packages/sed.htm

i only tested using one para of text from page 11.

Shree

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Wed, Nov 27, 2013 at 9:29 PM, V S Rawat <[email protected]> wrote:

> That is very convenient solution, Shree Devi ji.
>
> However, if sed or other "substitutors" are not there, or if one wants to
> avoid using them, I think it can be done using built in post-processing
> method of tesseract.
>
> use san.DangAmbigs.txt or hin.DangAmbigs.txt whichever language you are
> using.
>
> then put them as
> Å=Ā
> one per line.
>
> Should it work equally well and automatically, without needing manual step?
>
> if so, then, Shree Devi ji, is there any major benefit of post processing
> in sed?
>
> Please remind me where this DangAmbigs file is to be put?
>
> Thanks.
> --
> Rawat
>
>
> On 11/27/2013 6:50 PM, Shree Devi Kumar wrote:
>
>> I think rather than try to OCR, please extract the text and then run a
>> conversion script to change the letters with diacritical marks.
>>
>> eg. you would do the following substitution using sed for the sample
>> text from page 11
>>
>> s/Å/Ā/g
>> s/å/ā/g
>> s/®/ṛ/g
>> s/ß/ṣ/g
>> s/∫/ṇ/g
>> s/î/ī/g
>> s/Ê/Ī/g
>> s/¸/Ś/g
>> s/Ω/ś/g
>> s/ü/ū/g
>>
>> Also attaching sed script as a utf-8 text file.
>>
>> Shree Devi Kumar
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>>
>> On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     those Ā á character are defined in Garamond font, but the ASCII code
>>     used in this document is not the same as defined in Garamond font.
>>
>>     So, it is some other font where these ASCII codes have been defined
>>     for this character.
>>
>>     The document list a dozen fonts, some of it might be that. you need
>>     to figure out which font it could be, by hammer hit trial error
>> method.
>>
>>     Thanks.
>>     --
>>     Rawat
>>
>>
>>     On 11/27/2013 3:17 PM, Jaanus Henno wrote:
>>
>>         Ok, you can try page 11. There is glossary and lots of words with
>>         diacritics. Thanks.
>>
>>
>>         On Wed, Nov 27, 2013 at 4:41 PM, V S Rawat <[email protected]
>>         <mailto:[email protected]>
>>         <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>
>>
>>              "words with sanskrit transliteration marks are used"
>>
>>              could you please point out exact pages where to look for
>>         it. I will
>>              try to ocr it and see the results.
>>
>>              Also,
>>         http://www.omkarananda-ashram.____org/Sanskrit/
>> itranslator99.____htm#downloads
>>
>>
>>         <http://www.omkarananda-__ashram.org/Sanskrit/__
>> itranslator99.htm#downloads
>>
>>         <http://www.omkarananda-ashram.org/Sanskrit/
>> itranslator99.htm#downloads>>
>>
>>              The above page and several links from that page also have a
>>         lot of
>>              Sanskrit fonts. Maybe someone might be used by you.
>>
>>              Thanks.
>>              --
>>              Rawat
>>
>>
>>              On 11/27/2013 9:16 AM, Srivas wrote:
>>
>>                  Hi Rawat!
>>
>>                  I'm really sorry, I didn't know that this is a mailing
>>         list type of
>>                  forum ;-(
>>
>>                  Second, if you look carefully, you will see that the
>>         text is not
>>                  entirely english. In many places words with sanskrit
>>         transliteration
>>                  marks are used. But as you said, it can actually
>>         copy/pasted and it
>>                  didn't even come to my mind! So this part is actually
>>         working
>>                  and that
>>                  is great! So I am almost there. The remaining problem
>>         is another
>>                  type.
>>                  The provided tamalten font will display the marks, but
>>         I need to use
>>                  another font to display the final document. It also
>>         contains the
>>                  same
>>                  diacritical marks but uses another encoding. But this
>>         might be a
>>                  question to another person, I know the author of the
>>         fonts, I
>>                  will ask
>>                  him. Thanks for the help!
>>
>>                  Btw. If anyone needs to use sanskrit transliterated
>>         fonts, here
>>                  are the
>>                  resources:
>>         http://www.krishna-das.com/____ksyberspace/fonts/
>>         <http://www.krishna-das.com/__ksyberspace/fonts/>
>>
>>
>>                  <http://www.krishna-das.com/__ksyberspace/fonts/
>>         <http://www.krishna-das.com/ksyberspace/fonts/>>
>>
>>                  On Tuesday, November 26, 2013 4:47:11 PM UTC+7, V S
>>         Rawat wrote:
>>
>>                       Dear Sir Srivas ji,
>>
>>                       firstly, you should not have sent 2.2 MB 68 page
>>         pdf file
>>                  and 181 KB
>>                       zip
>>                       to all the list members unasked. You could have
>>         loaded it
>>                  somewhere and
>>                       sent the link so that only those download it who can
>>                  contribute in it.
>>                       It is a wastage of time and bandwidth to get such
>> huge
>>                  messages.
>>
>>                       Secondly, I couldn't really understand your issue.
>>         I saw
>>                  your pdf file.
>>                       it is pure English. You can open it in any pdf
>>         reader and
>>                  just copy
>>                       entire text from there and paste in a text or word
>>         file.
>>                  So, what else
>>                       exactly you are looking for, please elaborate.
>>
>>                       you don't even need to ocr it. These are already
>>         ASCII text.
>>
>>                       Thanks.
>>                       --
>>                       Rawat
>>
>>
>>                       On 11/26/2013 12:40 PM, Srivas wrote:
>>                        > Hi!
>>                        > I have a bunch of PDF files journals and I need
>>         to get
>>                  the text
>>                       out of
>>                        > it. They contain a lot of romanized sanskrit
>>         diacritical
>>                  marks
>>                       and that
>>                        > creates a difficulty. I tried Finereader and
>>         OmniPage
>>                  but they
>>                       cannot be
>>                        > trained to recognize those symbols. I just need
>>         an ORC
>>                  program I can
>>                        > train to show any symbol required and the above
>>         programs
>>                  cannot
>>                       do that.
>>                        >
>>                        > Where should I start from? I feel like this
>>         program can
>>                  do the
>>                       job but
>>                        > can you help me to get started? I downloaded
>>         tesseract and
>>                       installed it
>>                        > (windows). There are different GUIs available
>> and I
>>                  think it will
>>                       make
>>                        > it easier to work. Can you suggest a good one?
>>         I tried
>>                       gimagereader but
>>                        > it's too primitive and leaves a lot of work to
>>         be done
>>                  afterwards
>>                       with
>>                        > the overall text.
>>                        >
>>                        > I don't think this kind of language pack is
>>         available
>>                  and how to
>>                       create it?
>>                        >
>>                        > I will add one pdf and fonts that were used to
>>         create
>>                  it. Maybe
>>                       someone
>>                        > would like to try and let me know how to do it?
>>                        >
>>                        > Thank you for any help!
>>                        >
>>                        > Regards,
>>                        > Srivas
>>
>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> --- You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to