@Matt. I am looking for plain text extraction, no css or xpath. I just want
to extract text per page. So I would have array of plain text content on
which each index have content of a single page.
@Nick. I had progressed with the links you shared. Now my working handler
class is:
class PageContentHandler < ToXMLContentHandler
attr_accessor :page_tag
attr_accessor :page_number
attr_accessor :page_class
def initialize
@page_number = 0
@page_tag = 'div'
@page_class = 'page'
end
def startElement(uri, local_name, q_name, atts)
start_page() if @page_tag == q_name and atts.getValue('class') ==
@page_class
end
def endElement(uri, local_name, q_name)
end_page() if @page_tag == q_name
end
def start_page
@page_number = @page_number + 1
end
def end_page
return
end
end
and the code I am using is:
parser = AutoDetectParser.new
handler = PageContentHandler.new
parser.parse(input_stream, handler, @metadata_java, ParseContext.new)
if I use
puts handler.page_number
It gave exact and right page numbers in the document.
Now how can I reach to the text inside individual pages using that content
handler?
Regards
Nazar Hussain
On Wed, 15 Jul 2015 at 18:19 Mattmann, Chris A (3980) <
[email protected]> wrote:
> Also, Nazar, are you talking about e.g., Scrapy style extractions?
> If so, Tika has the Content Handler interface. From Java, this is
> relatively easy to call, but we don’t really provide a mechanism
> from the command line and/or REST server to call arbitrary extractions.
> Maybe we should think about doing that, guys?
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW: http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: Nick Burch <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Wednesday, July 15, 2015 at 1:59 AM
> To: "[email protected]" <[email protected]>
> Subject: Re: Per Page Document Content
>
> >On Wed, 15 Jul 2015, Nazar Hussain wrote:
> >> The problem I am facing is with pages. I can extract total pages from
> >> document metadata. But I can't find any way to extract content per page
> >> from the document.
> >
> >What file formats is this for? And how are you calling Tika?
> >
> >If the file format is page-based, eg PDF or PPT, then the html you get
> >back should have each page separated, IIRC by a div per page
> >
> >If the file format isn't a page-based one, and no page information is
> >available in the file, then there won't be page information in the HTML
> >as
> >Tika isn't able to render the document to spot page breaks.
> >
> >Nick
>
>