Yes in first phase I am targeting PDF and DOC files. Later will use PPT and
other but all would be page based documents.
I had read on different references on web that it returns div per page. Can
any one help out for exact code that works with Tika 1.9.
I have this code written in JRuby
class MyContentHandler < BodyContentHandler
attr_accessor :page_tag
attr_accessor :page_number
def initialize
@page_number = 0
@page_tag = 'div'
end
def start_element(uri, local_name, q_name, atts)
start_page() if @page_tag == q_name
end
def end_element(uri, local_name, q_name)
end_page() if @page_tag == q_name
end
def start_page
@page_number = @page_number + 1
end
def end_page
return
end
end
and using it as
parser = AutoDetectParser.new
handler = MyContentHandler.new
parser.parse(input_stream, handler, @metadata_java, ParseContext.new)
It executes right without any error. But how can I get content per page?
Regards
Nazar Hussain
On Wed, 15 Jul 2015 at 14:00 Nick Burch <[email protected]> wrote:
> On Wed, 15 Jul 2015, Nazar Hussain wrote:
> > The problem I am facing is with pages. I can extract total pages from
> > document metadata. But I can't find any way to extract content per page
> > from the document.
>
> What file formats is this for? And how are you calling Tika?
>
> If the file format is page-based, eg PDF or PPT, then the html you get
> back should have each page separated, IIRC by a div per page
>
> If the file format isn't a page-based one, and no page information is
> available in the file, then there won't be page information in the HTML as
> Tika isn't able to render the document to spot page breaks.
>
> Nick
>