sed is the command line part of vim.
type : in vim (when not in insert mode)
The cursor will move to bottom line and a line will open where you can
type any sed command and press enter and it will run on the text, you
can also define line range on which it should operate, and also select
confirmation mode for replace, a feature that I guess might not be
possible in sed script-wise replacement.
I hope I am saying it correctly.
Thanks.
--
Rawat
On 11/28/2013 9:31 AM, Jaanus Henno wrote:
How do you run sed on Vim?
On Thu, Nov 28, 2013 at 12:53 AM, V S Rawat <[email protected]
<mailto:[email protected]>> wrote:
Yes, for Srivas ji's file text is 100% text, not images, and is 100%
extractable to word/text file by simple copy paste. ocr is just not
needed.
Then, it is good that sed will make the changes without need of ocr.
Good thought.
I use vim on w8 so, I wouldn't downgrade to sed. he he. just
kidding. vim has sed built in. :-)
Thanks.
--
Rawat
On 11/27/2013 9:50 PM, Shree Devi Kumar wrote:
Rawatji,
I was going by the assumption that the text can be easily
extracted from
his pdf by saving as txt. In that case just running the sed
script will
fix the text for the letters with diacritics which were mapped
to some
other letters in the ascii font.
Doing OCR never gives 100% correct result, so to use the OCR
output and
postprocess in this case may not be the best solution.
You could try windows version of sed from
http://gnuwin32.sourceforge.__net/packages/sed.htm
<http://gnuwin32.sourceforge.net/packages/sed.htm>
i only tested using one para of text from page 11.
Shree
Shree Devi Kumar
______________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Wed, Nov 27, 2013 at 9:29 PM, V S Rawat <[email protected]
<mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>> wrote:
That is very convenient solution, Shree Devi ji.
However, if sed or other "substitutors" are not there, or
if one
wants to avoid using them, I think it can be done using
built in
post-processing method of tesseract.
use san.DangAmbigs.txt or hin.DangAmbigs.txt whichever
language you
are using.
then put them as
Å=Ā
one per line.
Should it work equally well and automatically, without needing
manual step?
if so, then, Shree Devi ji, is there any major benefit of post
processing in sed?
Please remind me where this DangAmbigs file is to be put?
Thanks.
--
Rawat
On 11/27/2013 6:50 PM, Shree Devi Kumar wrote:
I think rather than try to OCR, please extract the text
and then
run a
conversion script to change the letters with
diacritical marks.
eg. you would do the following substitution using sed
for the sample
text from page 11
s/Å/Ā/g
s/å/ā/g
s/®/ṛ/g
s/ß/ṣ/g
s/∫/ṇ/g
s/î/ī/g
s/Ê/Ī/g
s/¸/Ś/g
s/Ω/ś/g
s/ü/ū/g
Also attaching sed script as a utf-8 text file.
Shree Devi Kumar
__________________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
<mailto:[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>> wrote:
those Ā á character are defined in Garamond font,
but the
ASCII code
used in this document is not the same as defined in
Garamond font.
So, it is some other font where these ASCII codes
have been
defined
for this character.
The document list a dozen fonts, some of it might
be that.
you need
to figure out which font it could be, by hammer
hit trial
error method.
Thanks.
--
Rawat
On 11/27/2013 3:17 PM, Jaanus Henno wrote:
Ok, you can try page 11. There is glossary and
lots of
words with
diacritics. Thanks.
On Wed, Nov 27, 2013 at 4:41 PM, V S Rawat
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
<mailto:[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>>
<mailto:[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>
<mailto:[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>>> wrote:
"words with sanskrit transliteration
marks are used"
could you please point out exact pages
where to
look for
it. I will
try to ocr it and see the results.
Also,
http://www.omkarananda-ashram.________org/Sanskrit/____itranslator99.____htm#____downloads
<http://www.omkarananda-______ashram.org/Sanskrit/______itranslator99.htm#downloads
<http://www.omkarananda-____ashram.org/Sanskrit/____itranslator99.htm#downloads>
<http://www.omkarananda-____ashram.org/Sanskrit/____itranslator99.htm#downloads
<http://www.omkarananda-__ashram.org/Sanskrit/__itranslator99.htm#downloads>>
<http://www.omkarananda-____ashram.org/Sanskrit/____itranslator99.htm#downloads
<http://www.omkarananda-__ashram.org/Sanskrit/__itranslator99.htm#downloads>
<http://www.omkarananda-__ashram.org/Sanskrit/__itranslator99.htm#downloads
<http://www.omkarananda-ashram.org/Sanskrit/itranslator99.htm#downloads>>>>
The above page and several links from
that page
also have a
lot of
Sanskrit fonts. Maybe someone might be
used by you.
Thanks.
--
Rawat
On 11/27/2013 9:16 AM, Srivas wrote:
Hi Rawat!
I'm really sorry, I didn't know that
this is a
mailing
list type of
forum ;-(
Second, if you look carefully, you
will see
that the
text is not
entirely english. In many places
words with
sanskrit
transliteration
marks are used. But as you said, it
can actually
copy/pasted and it
didn't even come to my mind! So this
part is
actually
working
and that
is great! So I am almost there. The
remaining
problem
is another
type.
The provided tamalten font will
display the
marks, but
I need to use
another font to display the final
document. It
also
contains the
same
diacritical marks but uses another
encoding.
But this
might be a
question to another person, I know
the author
of the
fonts, I
will ask
him. Thanks for the help!
Btw. If anyone needs to use sanskrit
transliterated
fonts, here
are the
resources:
http://www.krishna-das.com/________ksyberspace/fonts/
<http://www.krishna-das.com/______ksyberspace/fonts/>
<http://www.krishna-das.com/______ksyberspace/fonts/
<http://www.krishna-das.com/____ksyberspace/fonts/>>
<http://www.krishna-das.com/______ksyberspace/fonts/
<http://www.krishna-das.com/____ksyberspace/fonts/>
<http://www.krishna-das.com/____ksyberspace/fonts/
<http://www.krishna-das.com/__ksyberspace/fonts/>>>
<http://www.krishna-das.com/______ksyberspace/fonts/
<http://www.krishna-das.com/____ksyberspace/fonts/>
<http://www.krishna-das.com/____ksyberspace/fonts/
<http://www.krishna-das.com/__ksyberspace/fonts/>>
<http://www.krishna-das.com/____ksyberspace/fonts/
<http://www.krishna-das.com/__ksyberspace/fonts/>
<http://www.krishna-das.com/__ksyberspace/fonts/
<http://www.krishna-das.com/ksyberspace/fonts/>>>>
On Tuesday, November 26, 2013 4:47:11 PM
UTC+7, V S
Rawat wrote:
Dear Sir Srivas ji,
firstly, you should not have
sent 2.2 MB
68 page
pdf file
and 181 KB
zip
to all the list members unasked. You
could have
loaded it
somewhere and
sent the link so that only those
download
it who can
contribute in it.
It is a wastage of time and
bandwidth to
get such huge
messages.
Secondly, I couldn't really
understand
your issue.
I saw
your pdf file.
it is pure English. You can open
it in
any pdf
reader and
just copy
entire text from there and paste
in a
text or word
file.
So, what else
exactly you are looking for, please
elaborate.
you don't even need to ocr it.
These are
already
ASCII text.
Thanks.
--
Rawat
On 11/26/2013 12:40 PM, Srivas
wrote:
> Hi!
> I have a bunch of PDF files
journals
and I need
to get
the text
out of
> it. They contain a lot of
romanized
sanskrit
diacritical
marks
and that
> creates a difficulty. I tried
Finereader and
OmniPage
but they
cannot be
> trained to recognize those
symbols. I
just need
an ORC
program I can
> train to show any symbol
required and
the above
programs
cannot
do that.
>
> Where should I start from? I
feel like
this
program can
do the
job but
> can you help me to get started? I
downloaded
tesseract and
installed it
> (windows). There are
different GUIs
available and I
think it will
make
> it easier to work. Can you
suggest a
good one?
I tried
gimagereader but
> it's too primitive and leaves
a lot of
work to
be done
afterwards
with
> the overall text.
>
> I don't think this kind of
language
pack is
available
and how to
create it?
>
> I will add one pdf and fonts
that were
used to
create
it. Maybe
someone
> would like to try and let me
know how
to do it?
>
> Thank you for any help!
>
> Regards,
> Srivas
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.