On Mon, 12 Sep 2016, Sergey Beryozkin wrote:
By the way, I've found out AutoDetectParser may not work if the (pdf) stream
is an attachment stream which may not support a mark.
Simplest would probably be just to wrap it in a TikaInputStream, which
would handle any buffering/marking as needed
Hi Tim
This is very helpful, thanks.
I'll experiment with the code below.
By the way, I've found out AutoDetectParser may not work if the (pdf)
stream is an attachment stream which may not support a mark.
I've been wondering, would it make sense to pass a MediaType identifying
the data format
Thanks. it looked like a very useful feature which would be worth
reinstating if the security vulnerability could be patched.
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: 12 September 2016 14:42
To: user@tika.apache.org
Subject: RE: Query on correct use
I think fileUrl only existed in 1.9. We removed it because it introduced a
security vulnerability
(http://www.openwall.com/lists/oss-security/2015/08/13/5).
I just updated the wiki. Sorry!
-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Monday, Septem
Hi Sergey,
> Is this code good enough to get all the content (and metadata) out of a
> 'simple' PDF ?
Yes, but...
> For example, Tim has mentioned that it is possible to handle embedded PDF
> attachments - I don't even know what they are, to me every PDF is just a text
> when I look at it :-
Hi, can you give me a favor and paste a -v output ?
-H identifies a request header, I wonder if it should be
curl -i fileUrl:http://www.bbc.co.uk/news -H "Accept: text/plain" -X PUT
http://localhost:9998/tika
?
(though I've never used this option)
Thanks, Sergey
On 11/09/16 09:48, John Doug