Yes in first phase I am targeting PDF and DOC files. Later will use PPT and
other but all would be page based documents.

I had read on different references on web that it returns div per page. Can
any one help out for exact code that works with Tika 1.9.

I have this code written in JRuby

class MyContentHandler < BodyContentHandler
  attr_accessor :page_tag
  attr_accessor :page_number

  def initialize
    @page_number = 0
    @page_tag = 'div'
  end

  def start_element(uri, local_name, q_name, atts)
    start_page() if @page_tag == q_name
  end

  def end_element(uri, local_name, q_name)
    end_page() if @page_tag == q_name
  end

  def start_page
    @page_number = @page_number + 1
  end

  def end_page
    return
  end
end

and using it as

parser = AutoDetectParser.new
handler = MyContentHandler.new
parser.parse(input_stream, handler, @metadata_java, ParseContext.new)

It executes right without any error. But how can I get content per page?

Regards
Nazar Hussain


On Wed, 15 Jul 2015 at 14:00 Nick Burch <[email protected]> wrote:

> On Wed, 15 Jul 2015, Nazar Hussain wrote:
> > The problem I am facing is with pages. I can extract total pages from
> > document metadata. But I can't find any way to extract content per page
> > from the document.
>
> What file formats is this for? And how are you calling Tika?
>
> If the file format is page-based, eg PDF or PPT, then the html you get
> back should have each page separated, IIRC by a div per page
>
> If the file format isn't a page-based one, and no page information is
> available in the file, then there won't be page information in the HTML as
> Tika isn't able to render the document to spot page breaks.
>
> Nick
>

Reply via email to