Re: Per Page Document Content

Nick Burch Wed, 15 Jul 2015 03:05:53 -0700

On Wed, 15 Jul 2015, Nazar Hussain wrote:

Yes in first phase I am targeting PDF and DOC files. Later will use PPTand other but all would be page based documents.

.doc is not a page based format, it's a run-based format. There is no pageinformation in the file format, it's calculated on the fly when the fileis rendered based on fonts, print settings etc.

I had read on different references on web that it returns div per page.Can any one help out for exact code that works with Tika 1.9.

I'd suggest trying with the Tika App first, use that to see what the xhtmllooks like.


Then, follow something like these two examples:
http://tika.apache.org/1.9/examples.html#Parsing_to_XHTML
http://tika.apache.org/1.9/examples.html#Fetching_just_certain_bits_of_the_XHTML

Nick

Re: Per Page Document Content

Reply via email to