I think that my previous post may hav been a bit missleading because when you
are parsing the document you are getting Paragraph objects so I think that
the best thing to do is something like the following;
if(paragraph instanceof ListEntry) {
// then we have a List or at least an entry into alist
}
else {
// We can assume we have a Paragrpah at this point but I do not think it
is possible
// to further check the type. The only thing we can do now is to see if
the Paragraph
// is in a Table cell. So
Table table docRange.getTable(paragraph);
if(table != null) {
// We are dealing with text that is in a table cell. Thus we have
found a table
// that can be dealt with here.
}
else {
// We are dealing with a paragraph of text only now.
}
}
Sadly, I have not been able to find the code I wrote that strips tables out
'in line' but I am sure it is on the list somewhere if you have a good
search through. Something like the above will allow you to detect lists,
tables and 'normal' paragraphs of text as they occur in the document; sadly
images are going to present a different problem I suspect and I do not as
yet know how to approach this particular problem.
One othet aspect you may want to consider are sections. Typically, the way I
process a Word document is;
Open the document.
Get the Range object for the document (which one depends upon whether I want
to process the headers/footers or not).
Ask that Range how many Paragraph objects it contains.
Iterate through the Paragraph's one at a time.
It is possible - at least I think it is - to abstract this up one level, so;
Open the document.
Get the Range object for the dcoument.
Ask it how many Sections it contains.
Iterate through the Sections and for each;
Ask it how many Paragraphs it contains.
Iterate through the Paragrahs.
Sections contain some information that may/will be valuable to you, not
least being the number of columns on the page.
Yours
Mark B
markl16 wrote:
>
> Yep i think you were on to something there, i tried:
> [code]
> if(paragraph instanceof ListEntry)
> {
> System.out.println("true");
> }
> else
> {
> System.out.println("false");
> }
> [/code]
> Which seemed to work, ill do some more research and see does a similar
> solution work for all the tags i want.
>
> Best
> Mark
>
>
> MSB wrote:
>>
>> I am hoping that it really is this simple but I cannot be too sure that
>> it really will be. The org.apache.poi.hwpf.usermodel.Range class is the
>> parent class for CharacterRun, DocumentPosition, Paragraph, Section,
>> Table and TableCell, whilst Paragraph is the parent of ListEntry. I have
>> never tried this but could it be as simple as using instanceof to test
>> what class you actually had in hand whilst parsing the document? It
>> should be easy enough to test this hypothesis;
>>
>> Open a document.
>> Get the top level Range object.
>> Get the number of Pargraphs.
>> Iterate through the Paragraphs one at a time and test to see what object
>> you actually have in hand.
>>
>> There are going to be one or two holes in this - I think that it will not
>> deal with pictures for example - but it could well be a way to start.
>>
>> Yours
>>
>> Mark B
>>
>
--
View this message in context:
http://old.nabble.com/Extract-Text-with-style-type-information-tp27209960p27284134.html
Sent from the POI - User mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]