Unsubscribe please On Tuesday, 27 January 2015, 19:39, Peter Murray-Rust <pm...@cam.ac.uk> wrote:
thomas...@gmail.com wrote... >>>I have a requirement to read tamil pdf document and store the content in db. When I read the document using Pdfbox, the characters are junked and not readable. I suppose this problem to be with fonts used. Can you help me to resolve this? It sounds as if you have a font which is not Unicode compliant and probably undocumented. We encounter a similar problem in scientific documents where characters with high Unicode points are represented by a variety of non-standard Fonts. in PDF2SVG (http://bitbucket.org/petermr/pdf2svg; which is built on top of PDFBox) we try to provide a debug and a variety of kludges to translate to Unicode. There are several messy ways in which characters are transmitted in non-Unicode fashion: * outline glyphs (i.e. the vectors representing the fonts) * bitmapped glyphs with names or code points The glyphs are usually supplied with the fonts. They are referenced eitehr by non-Unicode points or by names. There is no algorithmic way of solving the problem. If the font is in common use you *may* be able to find a translation table to Unicode by searching the web or asking, but in our experience this is uncommon. You may, if you are lucky, find a table of glyph images mapped onto codepoints or names. If you have a large document or many documents it will be necessary to create a translation table. We do this in PDF2SVG on a heuristic basis - sometimes the characters have a sequence that maps onto the alphabet or Unicode, but sometimes it's completely arbitrary. There could be other horrors - such as different codepoints for a character with different sizes (Microsoft does this for maths). I know nothing of Tamil - have read http://en.wikipedia.org/wiki/Tamil_script - and assuming your legacy font is systematic then you will to map these Unicode tables onto your font. Assuming you don't have a translation table you will have to do this manually, character by character. Assuming you can reliably recognize Tamil characters you can visually map the glyphs onto the rendered PDF onto Unicode. Alternatively you could print the characters to screen and use an Optical Character Recognition program. We are (slowly) developing this for mathematics and other symbols , but not for Tamil. You might find that a good OCR program is the best way forward. None of this will be huge fun, I am afraid - but the task is finite if there is only one font. -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069