Sorry Mark, my knowledge of XWPF is VERY limited indeed. It may be best to
start a new thread asking specifically about XWPF if you want to get a
better response. Having said that though and speaking as someone with a
limited knowledge of such things, would it not be possible to transform the
xml formatted file into 'your' xml format and simply remove XWPF from the
equation entirely?
Yours
Mark B
markl16 wrote:
>
> During my reasearch into HWPF i suddenly now need to look into XWPF for
> reading in a docx file, going through a file line by line and get style
> info so an alternative custom xml document can be created based on the
> tags in the original docx file.
>
> Just wondering does XWPF offer similar features to HWPF as discussed in
> this thread so far. I have read in a simple docx file and printed out the
> text but i dont see many methods to get details on style etc. I wonder
> would it be worth while investigating further.
>
> Best
> Mark
>
>
> MSB wrote:
>>
>> I think that my previous post may hav been a bit missleading because when
>> you are parsing the document you are getting Paragraph objects so I think
>> that the best thing to do is something like the following;
>>
>> if(paragraph instanceof ListEntry) {
>> // then we have a List or at least an entry into alist
>> }
>> else {
>> // We can assume we have a Paragrpah at this point but I do not think
>> it is possible
>> // to further check the type. The only thing we can do now is to see
>> if the Paragraph
>> // is in a Table cell. So
>> Table table docRange.getTable(paragraph);
>> if(table != null) {
>> // We are dealing with text that is in a table cell. Thus we have
>> found a table
>> // that can be dealt with here.
>> }
>> else {
>> // We are dealing with a paragraph of text only now.
>> }
>> }
>>
>> Sadly, I have not been able to find the code I wrote that strips tables
>> out 'in line' but I am sure it is on the list somewhere if you have a
>> good search through. Something like the above will allow you to detect
>> lists, tables and 'normal' paragraphs of text as they occur in the
>> document; sadly images are going to present a different problem I suspect
>> and I do not as yet know how to approach this particular problem.
>>
>> One othet aspect you may want to consider are sections. Typically, the
>> way I process a Word document is;
>>
>> Open the document.
>> Get the Range object for the document (which one depends upon whether I
>> want to process the headers/footers or not).
>> Ask that Range how many Paragraph objects it contains.
>> Iterate through the Paragraph's one at a time.
>>
>> It is possible - at least I think it is - to abstract this up one level,
>> so;
>>
>> Open the document.
>> Get the Range object for the dcoument.
>> Ask it how many Sections it contains.
>> Iterate through the Sections and for each;
>> Ask it how many Paragraphs it contains.
>> Iterate through the Paragrahs.
>>
>> Sections contain some information that may/will be valuable to you, not
>> least being the number of columns on the page.
>>
>> Yours
>>
>> Mark B
>>
>
>
--
View this message in context:
http://old.nabble.com/Extract-Text-with-style-type-information-tp27209960p27325894.html
Sent from the POI - User mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]