Thanks for the suggestions. The original pdf is 75MB, hence I had attached a single page.
I have been able to preprocess the images to change the grainyness to black. My son used gaussian blur and then changed black level to 150% in Photoshop. I plan to add a few of those images to my sanskrit training data in order to get the character shapes to match the typeface of book and will share that traineddata. Shree Shree Devi Kumar ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sat, Aug 24, 2013 at 9:09 AM, Sriranga(79yrs) < [email protected]> wrote: > Shree, > Better to upload the original PDF set without pre-process and also > traineddata file since failed in FreeOCR as well as gimagereader when > tested using san.traineddata. This is first experience faced by me. > With blessings, > sriranga(79yrs) > . > > > On Sat, Aug 24, 2013 at 12:08 AM, Sven Pedersen > <[email protected]>wrote: > >> The white areas within the characters in the PNG version are likely to >> confuse tesseract about the character shapes. Perhaps you can do something >> to improve that? I think someone has posted methods for dealing with that >> recently. >> --Sven >> >> >> On Fri, Aug 23, 2013 at 9:08 AM, Shree Devi Kumar >> <[email protected]>wrote: >> >>> I >>> want to OCR a sanskrit book available as a pdf. >>> >>> I used gsview to save all pages as png and >>> then used scantailor to deskew the images which saved them as tifs. >>> Then I used irfanview to apply blur and median filters as the text is >>> very grainy in the original and also resized the page to a smaller size. >>> >>> The pre-processed image as above is giving better result than original. >>> >>> I would like to know if there is a simpler/better method to pre-process >>> the image. The pdf is 500+ pages. >>> >>> I am attaching a single page from the pdf and the processed image file. >>> >>> Thnaks, >>> Shree >>> >>> -- >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To post to this group, send email to [email protected] >>> To unsubscribe from this group, send email to >>> [email protected] >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en >>> >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> >>> For more options, visit https://groups.google.com/groups/opt_out. >>> >> >> >> >> -- >> ``All that is gold does not glitter, >> not all those who wander are lost; >> the old that is strong does not wither, >> deep roots are not reached by the frost. >> From the ashes a fire shall be woken, >> a light from the shadows shall spring; >> renewed shall be blade that was broken, >> the crownless again shall be king.” >> >> >> -- >> ``All that is gold does not glitter, >> not all those who wander are lost; >> the old that is strong does not wither, >> deep roots are not reached by the frost. >> From the ashes a fire shall be woken, >> a light from the shadows shall spring; >> renewed shall be blade that was broken, >> the crownless again shall be king.” >> >> -- >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected] >> To unsubscribe from this group, send email to >> [email protected] >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> >> --- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> >> For more options, visit https://groups.google.com/groups/opt_out. >> > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

