Also, Nazar, are you talking about e.g., Scrapy style extractions? If so, Tika has the Content Handler interface. From Java, this is relatively easy to call, but we don’t really provide a mechanism from the command line and/or REST server to call arbitrary extractions. Maybe we should think about doing that, guys?
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Nick Burch <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Wednesday, July 15, 2015 at 1:59 AM To: "[email protected]" <[email protected]> Subject: Re: Per Page Document Content >On Wed, 15 Jul 2015, Nazar Hussain wrote: >> The problem I am facing is with pages. I can extract total pages from >> document metadata. But I can't find any way to extract content per page >> from the document. > >What file formats is this for? And how are you calling Tika? > >If the file format is page-based, eg PDF or PPT, then the html you get >back should have each page separated, IIRC by a div per page > >If the file format isn't a page-based one, and no page information is >available in the file, then there won't be page information in the HTML >as >Tika isn't able to render the document to spot page breaks. > >Nick
