Nick,
I finally solved my problem. Here is my content handler class:
class PageContentHandler < ToXMLContentHandler
attr_accessor :page_tag
attr_accessor :page_number
attr_accessor :page_class
attr_accessor :page_map
def initialize
@page_number = 0
@page_tag = 'div'
@page_class = 'page'
@page_map = Hash.new
end
def startElement(uri, local_name, q_name, atts)
start_page() if @page_tag == q_name and atts.getValue('class') ==
@page_class
end
def endElement(uri, local_name, q_name)
end_page() if @page_tag == q_name
end
def characters(ch, start, length)
if length > 0
builder = StringBuilder.new(length)
builder.append(ch)
@page_map[@page_number] << builder.to_s if @page_number > 0
end
end
def start_page
@page_number = @page_number + 1
@page_map[@page_number] = String.new
end
def end_page
return
end
end
Here is its usage:
parser = AutoDetectParser.new
handler = PageContentHandler.new
parser.parse(input_stream, handler, @metadata_java, ParseContext.new)
puts handler.page_map
I tested it with different pdf documents and it works 100% perfect.
Unfortunately it does not work well with docx format. In checked the XML
format of docx file, it doe not have any div with class page. Instead the
page is identified with <footer> tag. So at the moment I found solution by
converting every document into pdf and then extract content per page.
Regards
Nazar Hussain
On Wed, 15 Jul 2015 at 19:13 Nick Burch <[email protected]> wrote:
> On Wed, 15 Jul 2015, Nazar Hussain wrote:
> > @Matt. I am looking for plain text extraction, no css or xpath. I just
> > want to extract text per page. So I would have array of plain text
> > content on which each index have content of a single page.
>
> You won't be able to do it in the plain-text space. You'll need to extract
> as XHTML, split into pages based on the page divs, then down-convert the
> XHTML for each page into plain text
>
> If you have the plain text, then you've lost the page-break information.
> That's only there in the XHTML
>
> Nick
>