Re: How to parse PDF files effectively with Tika

2016-09-12 Thread Nick Burch
On Mon, 12 Sep 2016, Sergey Beryozkin wrote: By the way, I've found out AutoDetectParser may not work if the (pdf) stream is an attachment stream which may not support a mark. Simplest would probably be just to wrap it in a TikaInputStream, which would handle any buffering/marking as needed

Re: How to parse PDF files effectively with Tika

2016-09-12 Thread Sergey Beryozkin
Hi Tim This is very helpful, thanks. I'll experiment with the code below. By the way, I've found out AutoDetectParser may not work if the (pdf) stream is an attachment stream which may not support a mark. I've been wondering, would it make sense to pass a MediaType identifying the data format

RE: Query on correct use of 'fileUrl' in TikaJAXRS Server to extract document at remote url - my request is not working

2016-09-12 Thread John Dougrez-Lewis
Thanks. it looked like a very useful feature which would be worth reinstating if the security vulnerability could be patched. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: 12 September 2016 14:42 To: user@tika.apache.org Subject: RE: Query on correct use

RE: Query on correct use of 'fileUrl' in TikaJAXRS Server to extract document at remote url - my request is not working

2016-09-12 Thread Allison, Timothy B.
I think fileUrl only existed in 1.9. We removed it because it introduced a security vulnerability (http://www.openwall.com/lists/oss-security/2015/08/13/5). I just updated the wiki. Sorry! -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Monday, Septem

RE: How to parse PDF files effectively with Tika

2016-09-12 Thread Allison, Timothy B.
Hi Sergey, > Is this code good enough to get all the content (and metadata) out of a > 'simple' PDF ? Yes, but... > For example, Tim has mentioned that it is possible to handle embedded PDF > attachments - I don't even know what they are, to me every PDF is just a text > when I look at it :-

Re: Query on correct use of 'fileUrl' in TikaJAXRS Server to extract document at remote url - my request is not working

2016-09-12 Thread Sergey Beryozkin
Hi, can you give me a favor and paste a -v output ? -H identifies a request header, I wonder if it should be curl -i fileUrl:http://www.bbc.co.uk/news -H "Accept: text/plain" -X PUT http://localhost:9998/tika ? (though I've never used this option) Thanks, Sergey On 11/09/16 09:48, John Doug