@Matt. I am looking for plain text extraction, no css or xpath. I just want
to extract text per page. So I would have array of plain text content on
which each index have content of a single page.

@Nick. I had progressed with the links you shared. Now my working handler
class is:

class PageContentHandler < ToXMLContentHandler
  attr_accessor :page_tag
  attr_accessor :page_number
  attr_accessor :page_class

  def initialize
    @page_number = 0
    @page_tag = 'div'
    @page_class = 'page'
  end

  def startElement(uri, local_name, q_name, atts)
    start_page() if @page_tag == q_name and atts.getValue('class') ==
@page_class
  end

  def endElement(uri, local_name, q_name)
    end_page() if @page_tag == q_name
  end

  def start_page
    @page_number = @page_number + 1
  end

  def end_page
    return
  end
end

and the code I am using is:

parser = AutoDetectParser.new
handler = PageContentHandler.new
parser.parse(input_stream, handler, @metadata_java, ParseContext.new)

if I use

puts handler.page_number

It gave exact and right page numbers in the document.

Now how can I reach to the text inside individual pages using that content
handler?

Regards
Nazar Hussain

On Wed, 15 Jul 2015 at 18:19 Mattmann, Chris A (3980) <
[email protected]> wrote:

> Also, Nazar, are you talking about e.g., Scrapy style extractions?
> If so, Tika has the Content Handler interface. From Java, this is
> relatively easy to call, but we don’t really provide a mechanism
> from the command line and/or REST server to call arbitrary extractions.
> Maybe we should think about doing that, guys?
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: Nick Burch <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Wednesday, July 15, 2015 at 1:59 AM
> To: "[email protected]" <[email protected]>
> Subject: Re: Per Page Document Content
>
> >On Wed, 15 Jul 2015, Nazar Hussain wrote:
> >> The problem I am facing is with pages. I can extract total pages from
> >> document metadata. But I can't find any way to extract content per page
> >> from the document.
> >
> >What file formats is this for? And how are you calling Tika?
> >
> >If the file format is page-based, eg PDF or PPT, then the html you get
> >back should have each page separated, IIRC by a div per page
> >
> >If the file format isn't a page-based one, and no page information is
> >available in the file, then there won't be page information in the HTML
> >as
> >Tika isn't able to render the document to spot page breaks.
> >
> >Nick
>
>

Reply via email to