RE: Can I instruct the Tika Entity Processor to skip the first page using the DIH?

2015-07-09 Thread Allison, Timothy B.
Concur on both points. You can also use PDFBox's app ExtractText with -startPage and -endPage parameters: https://pdfbox.apache.org/1.8/commandline.html#extractText -Original Message- From: Charlie Hull [mailto:char...@flax.co.uk] Sent: Thursday, July 09, 2015 3:55 AM To:

Re: Can I instruct the Tika Entity Processor to skip the first page using the DIH?

2015-07-09 Thread Charlie Hull
On 08/07/2015 20:39, Allison, Timothy B. wrote: Unfortunately, no. We can't even do that now with straight Tika. I imagine this is for pdf files? If you'd like to add this as a feature, please submit a ticket over on Tika. Another alternative is to pre-process the PDF files to remove the

Can I instruct the Tika Entity Processor to skip the first page using the DIH?

2015-07-08 Thread Paden
Hello, I'm using the DIH to import some files from one of my local directories. However, every single one of these files has the same first page. So I want to skip that first page in order to optimize search. Can this be accomplished by an instruction within the dataimporthandler or, if not, how

RE: Can I instruct the Tika Entity Processor to skip the first page using the DIH?

2015-07-08 Thread Allison, Timothy B.
Unfortunately, no. We can't even do that now with straight Tika. I imagine this is for pdf files? If you'd like to add this as a feature, please submit a ticket over on Tika. -Original Message- From: Paden [mailto:rumsey...@gmail.com] Sent: Wednesday, July 08, 2015 12:14 PM To: