On Wed, 15 Jul 2015, Nazar Hussain wrote:
Yes in first phase I am targeting PDF and DOC files. Later will use PPT and other but all would be page based documents.

.doc is not a page based format, it's a run-based format. There is no page information in the file format, it's calculated on the fly when the file is rendered based on fonts, print settings etc.

I had read on different references on web that it returns div per page. Can any one help out for exact code that works with Tika 1.9.

I'd suggest trying with the Tika App first, use that to see what the xhtml looks like.

Then, follow something like these two examples:
http://tika.apache.org/1.9/examples.html#Parsing_to_XHTML
http://tika.apache.org/1.9/examples.html#Fetching_just_certain_bits_of_the_XHTML

Nick

Reply via email to