Re: Tika 2 parsers

2017-10-26 Thread Gethin James
The usecase is really when embedding Tika and transitive dependencies. I prefer the Tika 2 modular approach as it pulls in less jars, however, I don't have some much control over my existing version of PDFBox. I will explore using Tika Server! On 25 October 2017 at 17:44, Allison, Timothy B.

Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Nick Burch
On Thu, 26 Oct 2017, Chris Mattmann wrote: Why don’t we just store N copies of the stream, and parse it twice? I'm not sure that's the challenge though? Using TikaInputStream we can buffer to a temp file if needed to re-read the input Of course that’s the ugly way, but currently the way

Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Chris Mattmann
Thanks Nick. My general approach to conflicting metadata is simply to define precedence orders. For example here is one documented from OODT: https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence We can do similar things with Tika, e.g.,

Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Nick Burch
On Thu, 26 Oct 2017, Chris Mattmann wrote: My general approach to conflicting metadata is simply to define precedence orders. For example here is one documented from OODT: https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence We can do similar things

[jira] [Created] (TIKA-2483) Using PackageParser in ForkParser causes NPE

2017-10-26 Thread TzeKai Lee (JIRA)
TzeKai Lee created TIKA-2483: Summary: Using PackageParser in ForkParser causes NPE Key: TIKA-2483 URL: https://issues.apache.org/jira/browse/TIKA-2483 Project: Tika Issue Type: Bug

Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Nick Burch
Hi All Based on the plan on the wiki , we still have a major breaking change or two planned for Tika 2 that we haven't yet "broken". (There's also removing some deprecated stuff etc) As I

RE: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Allison, Timothy B.
At this point, I'm willing to punt to 3.x, unless there's momentum for either of these two. They would be great to have! 1) chaining multiple parsers -- additive This shouldn't be too bad, except where there's conflicting metadata -- parser1 says author is 'bob', parser2 says author is

Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Chris Mattmann
On collision, the precedence order defines what key takes precedence and _overwrites_ the other. Overwrite is but one option (you could save *all* the values it’s a multi-valued key structure so…) Cheers, Chris On 10/26/17, 9:43 AM, "Nick Burch" wrote: On Thu, 26

Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Chris Mattmann
Why don’t we just store N copies of the stream, and parse it twice? Of course that’s the ugly way, but currently the way I’ve hacked this in all of my projects is simply to call Tika N times OUTSIDE of Tika. Why don’t we just use that as the weakest baseline and work backwards from there? Chris