Also, Nazar, are you talking about e.g., Scrapy style extractions?
If so, Tika has the Content Handler interface. From Java, this is
relatively easy to call, but we don’t really provide a mechanism
from the command line and/or REST server to call arbitrary extractions.
Maybe we should think about doing that, guys?

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Nick Burch <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, July 15, 2015 at 1:59 AM
To: "[email protected]" <[email protected]>
Subject: Re: Per Page Document Content

>On Wed, 15 Jul 2015, Nazar Hussain wrote:
>> The problem I am facing is with pages. I can extract total pages from
>> document metadata. But I can't find any way to extract content per page
>> from the document.
>
>What file formats is this for? And how are you calling Tika?
>
>If the file format is page-based, eg PDF or PPT, then the html you get
>back should have each page separated, IIRC by a div per page
>
>If the file format isn't a page-based one, and no page information is
>available in the file, then there won't be page information in the HTML
>as 
>Tika isn't able to render the document to spot page breaks.
>
>Nick

Reply via email to