Concur on both points. You can also use PDFBox's app ExtractText with
-startPage and -endPage parameters:
https://pdfbox.apache.org/1.8/commandline.html#extractText
-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk]
Sent: Thursday, July 09, 2015 3:55 AM
To:
On 08/07/2015 20:39, Allison, Timothy B. wrote:
Unfortunately, no. We can't even do that now with straight Tika. I
imagine this is for pdf files? If you'd like to add this as a
feature, please submit a ticket over on Tika.
Another alternative is to pre-process the PDF files to remove the
Hello, I'm using the DIH to import some files from one of my local
directories. However, every single one of these files has the same first
page. So I want to skip that first page in order to optimize search.
Can this be accomplished by an instruction within the dataimporthandler or,
if not, how
Unfortunately, no. We can't even do that now with straight Tika. I imagine
this is for pdf files? If you'd like to add this as a feature, please submit a
ticket over on Tika.
-Original Message-
From: Paden [mailto:rumsey...@gmail.com]
Sent: Wednesday, July 08, 2015 12:14 PM
To: