Re: Per Page Document Content

Mattmann, Chris A (3980) Wed, 15 Jul 2015 06:19:58 -0700

Also, Nazar, are you talking about e.g., Scrapy style extractions?
If so, Tika has the Content Handler interface. From Java, this is
relatively easy to call, but we don’t really provide a mechanism
from the command line and/or REST server to call arbitrary extractions.
Maybe we should think about doing that, guys?


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Nick Burch <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, July 15, 2015 at 1:59 AM
To: "[email protected]" <[email protected]>
Subject: Re: Per Page Document Content

>On Wed, 15 Jul 2015, Nazar Hussain wrote:
>> The problem I am facing is with pages. I can extract total pages from
>> document metadata. But I can't find any way to extract content per page
>> from the document.
>
>What file formats is this for? And how are you calling Tika?
>
>If the file format is page-based, eg PDF or PPT, then the html you get
>back should have each page separated, IIRC by a div per page
>
>If the file format isn't a page-based one, and no page information is
>available in the file, then there won't be page information in the HTML
>as 
>Tika isn't able to render the document to spot page breaks.
>
>Nick

Re: Per Page Document Content

Reply via email to