Re: font styles and equations in word doc

MSB Sat, 11 Apr 2009 03:52:37 -0700

..............but, I do not think that it matters.

Was having a cup of coffee this morning and put together some code that used
HWPF to get the Paragraphs from a Word document and simply dump the text to
screen. The Word document contained a VERY simple formula - to calculate the
area of a circle by multiplying the value of pi by the radius of the circle
squared - and I used the Insert>Symbol dialog to insert the pi and 'squared'
symbols. Dumping the text of that document to screen, I saw what I suspect
you had done, the pi symbol was missing.


So, I created a bit more code that tool the text read by HWPF from the Word
document and used it to create a new rtf document using iText.
Unsurprisingly, the symbols I had inserted into the Word document appeared
perfectly in the rtf document. This is the code I used, and you may like to
give it a go.

    public void dumpDoc(String inputFilename, String outputFilename) {
        BufferedInputStream bufIStream = null;
        BufferedOutputStream bufOStream = null;
        FileInputStream fileIStream = null;
        FileOutputStream fileOStream = null;
        File inputFile = null;
        File outputFile = null;
        HWPFDocument doc = null;
        Range range = null;
        Paragraph para = null;
        int numParas = 0;
        
        RtfWriter2 rtfWriter = null;
        Document rtfDoc = null;
        Chunk chunk = null;

        try {
            outputFile = new File(outputFilename);
            fileOStream = new FileOutputStream(outputFile);
            bufOStream = new BufferedOutputStream(fileOStream);
            rtfDoc = new Document();
            rtfWriter = RtfWriter2.getInstance(rtfDoc, bufOStream);
            rtfDoc.open();
            
            inputFile = new File(inputFilename);
            fileIStream = new FileInputStream(inputFile);
            bufIStream = new BufferedInputStream(fileIStream);
            doc = new HWPFDocument(bufIStream);
            range = doc.getRange();
            numParas = range.numParagraphs();
            for(int i = 0; i < numParas; i++) {
                para = range.getParagraph(i);
                System.out.println("Paragraph number: " + i + " contains: "
+ para.text());
                this.showParaText(para.text());
                rtfDoc.add(new Chunk(para.text()));
            }
        }
        catch(Exception ex) {
            System.out.println("Caught an: " + ex.getClass().getName());
            System.out.println("Message: " + ex.getMessage());
            System.out.println("Stacktrace follows:..............");
            ex.printStackTrace(System.out);
        }
        finally {
            if(bufIStream != null) {
                try {
                    bufIStream.close();
                    bufIStream = null;
                    fileIStream = null;
                }
                catch(Exception ex) {
                    // I G N O R E //
                }
            }
            if(rtfWriter != null) {
                rtfWriter.flush();
                rtfWriter.close();
            }
        }
    }

When I get another few minutes, I will change the code to create a pdf using
iText and see what the result is. I hope it will show that we are worrying
about nothing.


nikhil n-2 wrote:
> 
> so,are there any classes which can retrieve these type of chars from the
> doc
> file.sorry for the late reply.
> 
> On Thu, Apr 9, 2009 at 12:47 AM, MSB <[email protected]> wrote:
> 
>>
>> Yes, I know the sort of think you mean now - when using Word I remember
>> having the option to open a complicated looking dialog box that allowed
>> me
>> to insert characters like the copyright and trademark symbols. I would
>> have
>> expected that if they could be placed into a Word document then they are
>> encoded somewhere and available to us. My only doubts here surround Words
>> use of Unicode - if it uses Unicode then everything should be OK.
>>
>> Also, I made another discovery tonight whilst playing with some code. If
>> you
>> remember my previous post, I got the CharacterRun(s) from the documents
>> high
>> level Range object. This does not have to be the case. You can do
>> something
>> like this;
>>
>>
>> HWPFDocument doc = new HWPFDocument(new FileInputStream(new
>> File("C;\\temp\\test.doc")));
>> Range = doc.getRange();
>> int numParagraphs = range.numParagraphs();
>> for(int i = 0; i < numParagraphs; i++) {
>>   Paragraph para = range.getParagraph(i);
>>   int numCharRuns = para.numCharacterRuns();
>>   for(int j = 0; J , numCharRuns; j++) {
>>      CharacterRun charRun = para.getCharacterRun(j);
>>      ..........
>>   }
>> }
>>
>> That would allow you to create new paragraphs ini the pdf file when you
>> need
>> to - if I remember correctly, pdf files contain markedup text organised
>> inot
>> paragraphs with the /par tag - and build each from the contents of the
>> character runs.
>>
>>
>> nikhil n-2 wrote:
>> >
>> > Thanks a lot sir for all the information.chars that may be present in a
>> > equation in a research paper are greek letters like pi,sigma,epsilon
>> > etc.they can be created in a microsoft word document as it provides
>> > options
>> > to insert such chars.but my doubt is how can i retrieve those chars
>> from
>> > the
>> > doc file by using hwpf.even if i am successfull in retrieving,i should
>> be
>> > able to write them in a pdf file using itext.once again thank u.
>> >
>> > On Wed, Apr 8, 2009 at 9:01 PM, MSB <[email protected]> wrote:
>> >
>> >>
>> >> Thanks for the reply, I understand what you are after a little better
>> >> now.
>> >>
>> >> As far as I am aware, formatting information is not exposed by the
>> >> Paragraph
>> >> class but by the CharacterRun -
>> >> org.apache.poi.hwpf.usermodel.CharacterRun
>> >> -
>> >> class. By no means am I an expert but I think that as the Word
>> document
>> >> is
>> >> parsed by HWPF, if and when the formatting applied to a piece of text
>> >> changes then it - the text - will be encapsulated within an instance
>> of
>> >> the
>> >> CharacterRun class. That class provides methods that allow you to get
>> at
>> >> the
>> >> colour of the text, the name and size of the font used, and so on. To
>> get
>> >> at
>> >> the CharacterRun(s) in the document you would do something like this;
>> >>
>> >> HWPFDocument doc = new HWPFDocument(new FileInputStream(new
>> >> File("C:\\temp\\test.doc")));
>> >> Range range = doc.getRange();
>> >> int numCharRuns = doc.numCharacterRuns();
>> >> CharacterRun charRun = null;
>> >> for(int i = 0; i < numCharRuns; i++) {
>> >>   charRun = doc.getCharacterRun(i);
>> >> }
>> >>
>> >> Then once you have the CharacterRun, you should be able to interrogate
>> >> that
>> >> object for lots of information - have a look at the javadoc to see all
>> of
>> >> the available methods. After obtaining the info, you ought to be able
>> to
>> >> use
>> >> iText to create the pdf file for you. My only concern is whether
>> working
>> >> through the document in this manner will allow you to accurately
>> >> re-create
>> >> it using iText; I guess that only a test will tell us this.
>> >>
>> >> The reason I asked about the nature of the research paper was that I
>> >> wanted
>> >> to get some idea of the sort of characters that are included. Forgive
>> me
>> >> please as I am 'mathmatically challenged' and do not know the terms to
>> >> describe the sort of operators found in mathmatical expressions, but I
>> >> feared that we may be dealing with those - knowing that the research
>> >> paper
>> >> is plain text removes that fear.
>> >>
>> >> Have a run with this and see how it works for you - I hope it may be
>> able
>> >> to
>> >> return some of the characters you were not seeing before. If not, we
>> may
>> >> need to look at other options. Should this fail again, is it possible
>> for
>> >> you to let me have a copy - assuming there is no proprietary
>> information
>> >> contained within it that should not be seen by anyone outside of your
>> >> institution - of the sort of document you are working with? That way,
>> I
>> >> can
>> >> experiment with it myself; for example, I have OpenOffice on my PC and
>> >> NetBeans configured so that I can create and run applications that use
>> >> Universal Network Objects (OpenOffice's API).
>> >>
>> >>
>> >> nikhil n-2 wrote:
>> >> >
>> >> > hii,
>> >> >
>> >> > i am new to hwpf.i am working on a project where i am supposed to
>> read
>> >> a
>> >> > research paper in ieee format from a doc file and convert it into a
>> pdf
>> >> > file
>> >> > in a customized format.
>> >> > to do that i need to know the font size variations in the text.i am
>> >> unable
>> >> > to read char's like pi,sigma etc present in equations.
>> >> >
>> >> > thank u.
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/font-styles-and-equations-in-word-doc-tp22927872p22953001.html
>> >> Sent from the POI - User mailing list archive at Nabble.com.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/font-styles-and-equations-in-word-doc-tp22927872p22957496.html
>> Sent from the POI - User mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/font-styles-and-equations-in-word-doc-tp22927872p22999987.html
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: font styles and equations in word doc

Reply via email to