How do you run sed on Vim?

On Thu, Nov 28, 2013 at 12:53 AM, V S Rawat <[email protected]> wrote:

> Yes, for Srivas ji's file text is 100% text, not images, and is 100%
> extractable to word/text file by simple copy paste. ocr is just not needed.
>
> Then, it is good that sed will make the changes without need of ocr. Good
> thought.
>
> I use vim on w8 so, I wouldn't downgrade to sed. he he. just kidding. vim
> has sed built in. :-)
>
> Thanks.
> --
> Rawat
>
> On 11/27/2013 9:50 PM, Shree Devi Kumar wrote:
>
>> Rawatji,
>>
>> I was going by the assumption that the text can be easily extracted from
>> his pdf by saving as txt. In that case just running the sed script will
>> fix the text for the letters  with diacritics which were mapped to some
>> other letters in the ascii font.
>>
>> Doing OCR never gives 100% correct result, so to use the OCR output and
>> postprocess in this case may not be the best solution.
>>
>> You could try windows version of sed from
>> http://gnuwin32.sourceforge.net/packages/sed.htm
>>
>> i only tested using one para of text from page 11.
>>
>> Shree
>>
>> Shree Devi Kumar
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>>
>> On Wed, Nov 27, 2013 at 9:29 PM, V S Rawat <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     That is very convenient solution, Shree Devi ji.
>>
>>     However, if sed or other "substitutors" are not there, or if one
>>     wants to avoid using them, I think it can be done using built in
>>     post-processing method of tesseract.
>>
>>     use san.DangAmbigs.txt or hin.DangAmbigs.txt whichever language you
>>     are using.
>>
>>     then put them as
>>     Å=Ā
>>     one per line.
>>
>>     Should it work equally well and automatically, without needing
>>     manual step?
>>
>>     if so, then, Shree Devi ji, is there any major benefit of post
>>     processing in sed?
>>
>>     Please remind me where this DangAmbigs file is to be put?
>>
>>     Thanks.
>>     --
>>     Rawat
>>
>>
>>     On 11/27/2013 6:50 PM, Shree Devi Kumar wrote:
>>
>>         I think rather than try to OCR, please extract the text and then
>>         run a
>>         conversion script to change the letters with diacritical marks.
>>
>>         eg. you would do the following substitution using sed for the
>> sample
>>         text from page 11
>>
>>         s/Å/Ā/g
>>         s/å/ā/g
>>         s/®/ṛ/g
>>         s/ß/ṣ/g
>>         s/∫/ṇ/g
>>         s/î/ī/g
>>         s/Ê/Ī/g
>>         s/¸/Ś/g
>>         s/Ω/ś/g
>>         s/ü/ū/g
>>
>>         Also attaching sed script as a utf-8 text file.
>>
>>         Shree Devi Kumar
>>         ______________________________________________________________
>>
>>         भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>>
>>         On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat <[email protected]
>>         <mailto:[email protected]>
>>         <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>
>>              those Ā á character are defined in Garamond font, but the
>>         ASCII code
>>              used in this document is not the same as defined in
>>         Garamond font.
>>
>>              So, it is some other font where these ASCII codes have been
>>         defined
>>              for this character.
>>
>>              The document list a dozen fonts, some of it might be that.
>>         you need
>>              to figure out which font it could be, by hammer hit trial
>>         error method.
>>
>>              Thanks.
>>              --
>>              Rawat
>>
>>
>>              On 11/27/2013 3:17 PM, Jaanus Henno wrote:
>>
>>                  Ok, you can try page 11. There is glossary and lots of
>>         words with
>>                  diacritics. Thanks.
>>
>>
>>                  On Wed, Nov 27, 2013 at 4:41 PM, V S Rawat
>>         <[email protected] <mailto:[email protected]>
>>                  <mailto:[email protected] <mailto:[email protected]>>
>>                  <mailto:[email protected] <mailto:[email protected]>
>>         <mailto:[email protected] <mailto:[email protected]>>>> wrote:
>>
>>
>>                       "words with sanskrit transliteration marks are used"
>>
>>                       could you please point out exact pages where to
>>         look for
>>                  it. I will
>>                       try to ocr it and see the results.
>>
>>                       Also,
>>         http://www.omkarananda-ashram.______org/Sanskrit/__
>> itranslator99.____htm#__downloads
>>
>>
>>
>>         <http://www.omkarananda-____ashram.org/Sanskrit/____
>> itranslator99.htm#downloads
>>         <http://www.omkarananda-__ashram.org/Sanskrit/__
>> itranslator99.htm#downloads>
>>
>>
>>
>>         <http://www.omkarananda-__ashram.org/Sanskrit/__
>> itranslator99.htm#downloads
>>         <http://www.omkarananda-ashram.org/Sanskrit/
>> itranslator99.htm#downloads>>>
>>
>>                       The above page and several links from that page
>>         also have a
>>                  lot of
>>                       Sanskrit fonts. Maybe someone might be used by you.
>>
>>                       Thanks.
>>                       --
>>                       Rawat
>>
>>
>>                       On 11/27/2013 9:16 AM, Srivas wrote:
>>
>>                           Hi Rawat!
>>
>>                           I'm really sorry, I didn't know that this is a
>>         mailing
>>                  list type of
>>                           forum ;-(
>>
>>                           Second, if you look carefully, you will see
>>         that the
>>                  text is not
>>                           entirely english. In many places words with
>>         sanskrit
>>                  transliteration
>>                           marks are used. But as you said, it can actually
>>                  copy/pasted and it
>>                           didn't even come to my mind! So this part is
>>         actually
>>                  working
>>                           and that
>>                           is great! So I am almost there. The remaining
>>         problem
>>                  is another
>>                           type.
>>                           The provided tamalten font will display the
>>         marks, but
>>                  I need to use
>>                           another font to display the final document. It
>>         also
>>                  contains the
>>                           same
>>                           diacritical marks but uses another encoding.
>>         But this
>>                  might be a
>>                           question to another person, I know the author
>>         of the
>>                  fonts, I
>>                           will ask
>>                           him. Thanks for the help!
>>
>>                           Btw. If anyone needs to use sanskrit
>>         transliterated
>>                  fonts, here
>>                           are the
>>                           resources:
>>         http://www.krishna-das.com/______ksyberspace/fonts/
>>         <http://www.krishna-das.com/____ksyberspace/fonts/>
>>
>>                  <http://www.krishna-das.com/____ksyberspace/fonts/
>>         <http://www.krishna-das.com/__ksyberspace/fonts/>>
>>
>>
>>
>>           <http://www.krishna-das.com/____ksyberspace/fonts/
>>         <http://www.krishna-das.com/__ksyberspace/fonts/>
>>                  <http://www.krishna-das.com/__ksyberspace/fonts/
>>         <http://www.krishna-das.com/ksyberspace/fonts/>>>
>>
>>                           On Tuesday, November 26, 2013 4:47:11 PM
>>         UTC+7, V S
>>                  Rawat wrote:
>>
>>                                Dear Sir Srivas ji,
>>
>>                                firstly, you should not have sent 2.2 MB
>>         68 page
>>                  pdf file
>>                           and 181 KB
>>                                zip
>>                                to all the list members unasked. You
>>         could have
>>                  loaded it
>>                           somewhere and
>>                                sent the link so that only those download
>>         it who can
>>                           contribute in it.
>>                                It is a wastage of time and bandwidth to
>>         get such huge
>>                           messages.
>>
>>                                Secondly, I couldn't really understand
>>         your issue.
>>                  I saw
>>                           your pdf file.
>>                                it is pure English. You can open it in
>>         any pdf
>>                  reader and
>>                           just copy
>>                                entire text from there and paste in a
>>         text or word
>>                  file.
>>                           So, what else
>>                                exactly you are looking for, please
>>         elaborate.
>>
>>                                you don't even need to ocr it. These are
>>         already
>>                  ASCII text.
>>
>>                                Thanks.
>>                                --
>>                                Rawat
>>
>>
>>                                On 11/26/2013 12:40 PM, Srivas wrote:
>>                                 > Hi!
>>                                 > I have a bunch of PDF files journals
>>         and I need
>>                  to get
>>                           the text
>>                                out of
>>                                 > it. They contain a lot of romanized
>>         sanskrit
>>                  diacritical
>>                           marks
>>                                and that
>>                                 > creates a difficulty. I tried
>>         Finereader and
>>                  OmniPage
>>                           but they
>>                                cannot be
>>                                 > trained to recognize those symbols. I
>>         just need
>>                  an ORC
>>                           program I can
>>                                 > train to show any symbol required and
>>         the above
>>                  programs
>>                           cannot
>>                                do that.
>>                                 >
>>                                 > Where should I start from? I feel like
>>         this
>>                  program can
>>                           do the
>>                                job but
>>                                 > can you help me to get started? I
>>         downloaded
>>                  tesseract and
>>                                installed it
>>                                 > (windows). There are different GUIs
>>         available and I
>>                           think it will
>>                                make
>>                                 > it easier to work. Can you suggest a
>>         good one?
>>                  I tried
>>                                gimagereader but
>>                                 > it's too primitive and leaves a lot of
>>         work to
>>                  be done
>>                           afterwards
>>                                with
>>                                 > the overall text.
>>                                 >
>>                                 > I don't think this kind of language
>>         pack is
>>                  available
>>                           and how to
>>                                create it?
>>                                 >
>>                                 > I will add one pdf and fonts that were
>>         used to
>>                  create
>>                           it. Maybe
>>                                someone
>>                                 > would like to try and let me know how
>>         to do it?
>>                                 >
>>                                 > Thank you for any help!
>>                                 >
>>                                 > Regards,
>>                                 > Srivas
>>
>>
>>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> --- You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/
> topic/tesseract-ocr/6uG7HUxLY7w/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to