Nick,

I finally solved my problem. Here is my content handler class:

class PageContentHandler < ToXMLContentHandler
  attr_accessor :page_tag
  attr_accessor :page_number
  attr_accessor :page_class
  attr_accessor :page_map

  def initialize
    @page_number = 0
    @page_tag = 'div'
    @page_class = 'page'
    @page_map = Hash.new
  end

  def startElement(uri, local_name, q_name, atts)
    start_page() if @page_tag == q_name and atts.getValue('class') ==
@page_class
  end

  def endElement(uri, local_name, q_name)
    end_page() if @page_tag == q_name
  end

  def characters(ch, start, length)
    if length > 0
      builder = StringBuilder.new(length)
      builder.append(ch)
      @page_map[@page_number] << builder.to_s if @page_number > 0
    end
  end

  def start_page
    @page_number = @page_number + 1
    @page_map[@page_number] = String.new
  end

  def end_page
    return
  end
end

Here is its usage:

parser = AutoDetectParser.new
handler = PageContentHandler.new
parser.parse(input_stream, handler, @metadata_java, ParseContext.new)
puts handler.page_map

I tested it with different pdf documents and it works 100% perfect.
Unfortunately it does not work well with docx format. In checked the XML
format of docx file, it doe not have any div with class page. Instead the
page is identified with <footer> tag. So at the moment I found solution by
converting every document into pdf and then extract content per page.

Regards
Nazar Hussain


On Wed, 15 Jul 2015 at 19:13 Nick Burch <[email protected]> wrote:

> On Wed, 15 Jul 2015, Nazar Hussain wrote:
> > @Matt. I am looking for plain text extraction, no css or xpath. I just
> > want to extract text per page. So I would have array of plain text
> > content on which each index have content of a single page.
>
> You won't be able to do it in the plain-text space. You'll need to extract
> as XHTML, split into pages based on the page divs, then down-convert the
> XHTML for each page into plain text
>
> If you have the plain text, then you've lost the page-break information.
> That's only there in the XHTML
>
> Nick
>

Reply via email to