Not as far as I am aware, no, there are no classes designed to recover these characters. Outside of the Paragraph, CharacterRun and TextPiece classes, I do not think anything else is available to recover the contents of the document. Over the weekend, I should have a few minutes to mess around with a bit of code - it looks like being too wet to work outside so I will be confined to the workshop and office where there are plenty of opportunities to take a tea break!! Should I manage to find anything then I will drop you a message.
In addition, I will test OpeOffice. Some time ago, I had to write a java class that would - amongst other things, convert Word documents to pdf format using OpenOffice. I will run it against a document containing lots of those 'special' characters and see what happens. Further, if I can, I will add amethod to recover a paragraph from the document - one containing symbols - and see what it recovers. PS I hope you ignored the drivel at the end of my last message. With iText in minf, my brain automatically thinks of Rich Text Format files as I have used the API to manipulate them. RTF is marked up text, pdf - of course - is not. Brain fade, sorry. nikhil n-2 wrote: > > so,are there any classes which can retrieve these type of chars from the > doc > file.sorry for the late reply. > > On Thu, Apr 9, 2009 at 12:47 AM, MSB <[email protected]> wrote: > >> >> Yes, I know the sort of think you mean now - when using Word I remember >> having the option to open a complicated looking dialog box that allowed >> me >> to insert characters like the copyright and trademark symbols. I would >> have >> expected that if they could be placed into a Word document then they are >> encoded somewhere and available to us. My only doubts here surround Words >> use of Unicode - if it uses Unicode then everything should be OK. >> >> Also, I made another discovery tonight whilst playing with some code. If >> you >> remember my previous post, I got the CharacterRun(s) from the documents >> high >> level Range object. This does not have to be the case. You can do >> something >> like this; >> >> >> HWPFDocument doc = new HWPFDocument(new FileInputStream(new >> File("C;\\temp\\test.doc"))); >> Range = doc.getRange(); >> int numParagraphs = range.numParagraphs(); >> for(int i = 0; i < numParagraphs; i++) { >> Paragraph para = range.getParagraph(i); >> int numCharRuns = para.numCharacterRuns(); >> for(int j = 0; J , numCharRuns; j++) { >> CharacterRun charRun = para.getCharacterRun(j); >> .......... >> } >> } >> >> That would allow you to create new paragraphs ini the pdf file when you >> need >> to - if I remember correctly, pdf files contain markedup text organised >> inot >> paragraphs with the /par tag - and build each from the contents of the >> character runs. >> >> >> nikhil n-2 wrote: >> > >> > Thanks a lot sir for all the information.chars that may be present in a >> > equation in a research paper are greek letters like pi,sigma,epsilon >> > etc.they can be created in a microsoft word document as it provides >> > options >> > to insert such chars.but my doubt is how can i retrieve those chars >> from >> > the >> > doc file by using hwpf.even if i am successfull in retrieving,i should >> be >> > able to write them in a pdf file using itext.once again thank u. >> > >> > On Wed, Apr 8, 2009 at 9:01 PM, MSB <[email protected]> wrote: >> > >> >> >> >> Thanks for the reply, I understand what you are after a little better >> >> now. >> >> >> >> As far as I am aware, formatting information is not exposed by the >> >> Paragraph >> >> class but by the CharacterRun - >> >> org.apache.poi.hwpf.usermodel.CharacterRun >> >> - >> >> class. By no means am I an expert but I think that as the Word >> document >> >> is >> >> parsed by HWPF, if and when the formatting applied to a piece of text >> >> changes then it - the text - will be encapsulated within an instance >> of >> >> the >> >> CharacterRun class. That class provides methods that allow you to get >> at >> >> the >> >> colour of the text, the name and size of the font used, and so on. To >> get >> >> at >> >> the CharacterRun(s) in the document you would do something like this; >> >> >> >> HWPFDocument doc = new HWPFDocument(new FileInputStream(new >> >> File("C:\\temp\\test.doc"))); >> >> Range range = doc.getRange(); >> >> int numCharRuns = doc.numCharacterRuns(); >> >> CharacterRun charRun = null; >> >> for(int i = 0; i < numCharRuns; i++) { >> >> charRun = doc.getCharacterRun(i); >> >> } >> >> >> >> Then once you have the CharacterRun, you should be able to interrogate >> >> that >> >> object for lots of information - have a look at the javadoc to see all >> of >> >> the available methods. After obtaining the info, you ought to be able >> to >> >> use >> >> iText to create the pdf file for you. My only concern is whether >> working >> >> through the document in this manner will allow you to accurately >> >> re-create >> >> it using iText; I guess that only a test will tell us this. >> >> >> >> The reason I asked about the nature of the research paper was that I >> >> wanted >> >> to get some idea of the sort of characters that are included. Forgive >> me >> >> please as I am 'mathmatically challenged' and do not know the terms to >> >> describe the sort of operators found in mathmatical expressions, but I >> >> feared that we may be dealing with those - knowing that the research >> >> paper >> >> is plain text removes that fear. >> >> >> >> Have a run with this and see how it works for you - I hope it may be >> able >> >> to >> >> return some of the characters you were not seeing before. If not, we >> may >> >> need to look at other options. Should this fail again, is it possible >> for >> >> you to let me have a copy - assuming there is no proprietary >> information >> >> contained within it that should not be seen by anyone outside of your >> >> institution - of the sort of document you are working with? That way, >> I >> >> can >> >> experiment with it myself; for example, I have OpenOffice on my PC and >> >> NetBeans configured so that I can create and run applications that use >> >> Universal Network Objects (OpenOffice's API). >> >> >> >> >> >> nikhil n-2 wrote: >> >> > >> >> > hii, >> >> > >> >> > i am new to hwpf.i am working on a project where i am supposed to >> read >> >> a >> >> > research paper in ieee format from a doc file and convert it into a >> pdf >> >> > file >> >> > in a customized format. >> >> > to do that i need to know the font size variations in the text.i am >> >> unable >> >> > to read char's like pi,sigma etc present in equations. >> >> > >> >> > thank u. >> >> > >> >> > >> >> >> >> -- >> >> View this message in context: >> >> >> http://www.nabble.com/font-styles-and-equations-in-word-doc-tp22927872p22953001.html >> >> Sent from the POI - User mailing list archive at Nabble.com. >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: [email protected] >> >> For additional commands, e-mail: [email protected] >> >> >> >> >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/font-styles-and-equations-in-word-doc-tp22927872p22957496.html >> Sent from the POI - User mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > > -- View this message in context: http://www.nabble.com/font-styles-and-equations-in-word-doc-tp22927872p22998479.html Sent from the POI - User mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
