Re: How to parse PDF files effectively with Tika

2016-09-15 Thread Sergey Beryozkin
.@gmail.com] Sent: Friday, September 9, 2016 10:06 AM To: user@tika.apache.org Subject: How to parse PDF files effectively with Tika Hi All While I've experimented with writing a simple demo code which creates a Tika PDFParser (and few other parsers) and provides a ToTextContentHandler fo

Re: How to parse PDF files effectively with Tika

2016-09-12 Thread Nick Burch
On Mon, 12 Sep 2016, Sergey Beryozkin wrote: By the way, I've found out AutoDetectParser may not work if the (pdf) stream is an attachment stream which may not support a mark. Simplest would probably be just to wrap it in a TikaInputStream, which would handle any buffering/marking as needed

Re: How to parse PDF files effectively with Tika

2016-09-12 Thread Sergey Beryozkin
return wrapper.getMetadata(); -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Friday, September 9, 2016 10:06 AM To: user@tika.apache.org Subject: How to parse PDF files effectively with Tika Hi All While I've experimented with writing a simple demo code which

RE: How to parse PDF files effectively with Tika

2016-09-12 Thread Allison, Timothy B.
efaultHandler(), new Metadata(), context); } return wrapper.getMetadata(); -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Friday, September 9, 2016 10:06 AM To: user@tika.apache.org Subject: How to parse PDF files effectively with Tika Hi All While I've

How to parse PDF files effectively with Tika

2016-09-09 Thread Sergey Beryozkin
Hi All While I've experimented with writing a simple demo code which creates a Tika PDFParser (and few other parsers) and provides a ToTextContentHandler for it to return the content, I'm realizing I'm not really quite sure what the best strategy is. For example, Tim has mentioned that it is