Re: font styles and equations in word doc

MSB Fri, 10 Apr 2009 23:47:33 -0700

Not as far as I am aware, no, there are no classes designed to recover these
characters. Outside of the Paragraph, CharacterRun and TextPiece classes, I
do not think anything else is available to recover the contents of the
document. Over the weekend, I should have a few minutes to mess around with
a bit of code - it looks like being too wet to work outside so I will be
confined to the workshop and office where there are plenty of opportunities
to take a tea break!! Should I manage to find anything then I will drop you
a message.


In addition, I will test OpeOffice. Some time ago, I had to write a java
class that would - amongst other things, convert Word documents to pdf
format using OpenOffice. I will run it against a document containing lots of
those 'special' characters and see what happens. Further, if I can, I will
add amethod to recover a paragraph from the document - one containing
symbols - and see what it recovers.

PS I hope you ignored the drivel at the end of my last message. With iText
in minf, my brain automatically thinks of Rich Text Format files as I have
used the API to manipulate them. RTF is marked up text, pdf - of course - is
not. Brain fade, sorry.


nikhil n-2 wrote:
> 
> so,are there any classes which can retrieve these type of chars from the
> doc
> file.sorry for the late reply.
> 
> On Thu, Apr 9, 2009 at 12:47 AM, MSB <[email protected]> wrote:
> 
>>
>> Yes, I know the sort of think you mean now - when using Word I remember
>> having the option to open a complicated looking dialog box that allowed
>> me
>> to insert characters like the copyright and trademark symbols. I would
>> have
>> expected that if they could be placed into a Word document then they are
>> encoded somewhere and available to us. My only doubts here surround Words
>> use of Unicode - if it uses Unicode then everything should be OK.
>>
>> Also, I made another discovery tonight whilst playing with some code. If
>> you
>> remember my previous post, I got the CharacterRun(s) from the documents
>> high
>> level Range object. This does not have to be the case. You can do
>> something
>> like this;
>>
>>
>> HWPFDocument doc = new HWPFDocument(new FileInputStream(new
>> File("C;\\temp\\test.doc")));
>> Range = doc.getRange();
>> int numParagraphs = range.numParagraphs();
>> for(int i = 0; i < numParagraphs; i++) {
>>   Paragraph para = range.getParagraph(i);
>>   int numCharRuns = para.numCharacterRuns();
>>   for(int j = 0; J , numCharRuns; j++) {
>>      CharacterRun charRun = para.getCharacterRun(j);
>>      ..........
>>   }
>> }
>>
>> That would allow you to create new paragraphs ini the pdf file when you
>> need
>> to - if I remember correctly, pdf files contain markedup text organised
>> inot
>> paragraphs with the /par tag - and build each from the contents of the
>> character runs.
>>
>>
>> nikhil n-2 wrote:
>> >
>> > Thanks a lot sir for all the information.chars that may be present in a
>> > equation in a research paper are greek letters like pi,sigma,epsilon
>> > etc.they can be created in a microsoft word document as it provides
>> > options
>> > to insert such chars.but my doubt is how can i retrieve those chars
>> from
>> > the
>> > doc file by using hwpf.even if i am successfull in retrieving,i should
>> be
>> > able to write them in a pdf file using itext.once again thank u.
>> >
>> > On Wed, Apr 8, 2009 at 9:01 PM, MSB <[email protected]> wrote:
>> >
>> >>
>> >> Thanks for the reply, I understand what you are after a little better
>> >> now.
>> >>
>> >> As far as I am aware, formatting information is not exposed by the
>> >> Paragraph
>> >> class but by the CharacterRun -
>> >> org.apache.poi.hwpf.usermodel.CharacterRun
>> >> -
>> >> class. By no means am I an expert but I think that as the Word
>> document
>> >> is
>> >> parsed by HWPF, if and when the formatting applied to a piece of text
>> >> changes then it - the text - will be encapsulated within an instance
>> of
>> >> the
>> >> CharacterRun class. That class provides methods that allow you to get
>> at
>> >> the
>> >> colour of the text, the name and size of the font used, and so on. To
>> get
>> >> at
>> >> the CharacterRun(s) in the document you would do something like this;
>> >>
>> >> HWPFDocument doc = new HWPFDocument(new FileInputStream(new
>> >> File("C:\\temp\\test.doc")));
>> >> Range range = doc.getRange();
>> >> int numCharRuns = doc.numCharacterRuns();
>> >> CharacterRun charRun = null;
>> >> for(int i = 0; i < numCharRuns; i++) {
>> >>   charRun = doc.getCharacterRun(i);
>> >> }
>> >>
>> >> Then once you have the CharacterRun, you should be able to interrogate
>> >> that
>> >> object for lots of information - have a look at the javadoc to see all
>> of
>> >> the available methods. After obtaining the info, you ought to be able
>> to
>> >> use
>> >> iText to create the pdf file for you. My only concern is whether
>> working
>> >> through the document in this manner will allow you to accurately
>> >> re-create
>> >> it using iText; I guess that only a test will tell us this.
>> >>
>> >> The reason I asked about the nature of the research paper was that I
>> >> wanted
>> >> to get some idea of the sort of characters that are included. Forgive
>> me
>> >> please as I am 'mathmatically challenged' and do not know the terms to
>> >> describe the sort of operators found in mathmatical expressions, but I
>> >> feared that we may be dealing with those - knowing that the research
>> >> paper
>> >> is plain text removes that fear.
>> >>
>> >> Have a run with this and see how it works for you - I hope it may be
>> able
>> >> to
>> >> return some of the characters you were not seeing before. If not, we
>> may
>> >> need to look at other options. Should this fail again, is it possible
>> for
>> >> you to let me have a copy - assuming there is no proprietary
>> information
>> >> contained within it that should not be seen by anyone outside of your
>> >> institution - of the sort of document you are working with? That way,
>> I
>> >> can
>> >> experiment with it myself; for example, I have OpenOffice on my PC and
>> >> NetBeans configured so that I can create and run applications that use
>> >> Universal Network Objects (OpenOffice's API).
>> >>
>> >>
>> >> nikhil n-2 wrote:
>> >> >
>> >> > hii,
>> >> >
>> >> > i am new to hwpf.i am working on a project where i am supposed to
>> read
>> >> a
>> >> > research paper in ieee format from a doc file and convert it into a
>> pdf
>> >> > file
>> >> > in a customized format.
>> >> > to do that i need to know the font size variations in the text.i am
>> >> unable
>> >> > to read char's like pi,sigma etc present in equations.
>> >> >
>> >> > thank u.
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/font-styles-and-equations-in-word-doc-tp22927872p22953001.html
>> >> Sent from the POI - User mailing list archive at Nabble.com.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/font-styles-and-equations-in-word-doc-tp22927872p22957496.html
>> Sent from the POI - User mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/font-styles-and-equations-in-word-doc-tp22927872p22998479.html
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: font styles and equations in word doc

Reply via email to