Re: Not-yet-broken breaking changes for Tika 2?

2018-02-07 Thread Chris Mattmann
...@mitre.org> wrote: Do we worry about properly closing tags on an exception? kaboom mailto:lfcnas...@gmail.com] Sent: Monday, February 5, 2018 5:34 PM To: dev@tika.apache.org Subject: Re: Not-yet-broken breaking changes for Tika

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-07 Thread Luís Filipe Nassif
exception? > > > > > kaboom > mailto:lfcnas...@gmail.com] > Sent: Monday, February 5, 2018 5:34 PM > To: dev@tika.apache.org > Subject: Re: Not-yet-broken breaking changes for Tika 2? > > From a forensic use case it is better just saying we are

RE: Not-yet-broken breaking changes for Tika 2?

2018-02-07 Thread Allison, Timothy B.
Do we worry about properly closing tags on an exception? kaboom mailto:lfcnas...@gmail.com] Sent: Monday, February 5, 2018 5:34 PM To: dev@tika.apache.org Subject: Re: Not-yet-broken breaking changes for Tika 2? From a forensic use case it is better just

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Luís Filipe Nassif
sage- From: Mattmann, Chris A (1761) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Monday, February 5, 2018 12:29 PM To: dev@tika.apache.org Subject: Re: Not-yet-broken breaking changes for Tika 2? Our solution is just to run the parser 2xyes I get it will i

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Chris Mattmann
Original Message- From: Mattmann, Chris A (1761) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Monday, February 5, 2018 12:29 PM To: dev@tika.apache.org Subject: Re: Not-yet-broken breaking changes for Tika 2? Our solution is just to run the parser 2xyes I get it will

RE: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Allison, Timothy B.
-broken breaking changes for Tika 2? Ping - anyone got any thoughts on the proposed metadata parser stuff, and any ideas on the content part? On Tue, 2 Jan 2018, Nick Burch wrote: > On Thu, 26 Oct 2017, Chris Mattmann wrote: >> On collision, the precedence order defines what key takes p

RE: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Allison, Timothy B.
type that has a reset() method? Or do we just say, hey, now we're trying a different parser... -Original Message- From: Mattmann, Chris A (1761) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Monday, February 5, 2018 12:29 PM To: dev@tika.apache.org Subject: Re: Not-yet-broken breaking

RE: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Allison, Timothy B.
Spool to temp file? -Original Message- From: Mattmann, Chris A (1761) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Monday, February 5, 2018 12:29 PM To: dev@tika.apache.org Subject: Re: Not-yet-broken breaking changes for Tika 2? Our solution is just to run the parser 2xyes I get

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Mattmann, Chris A (1761)
Our solution is just to run the parser 2xyes I get it will induce overhead, but as a start, why not? In short just run through the stream 2x ++ Chris Mattmann, Ph.D. Associate Chief Technology and Innovation Officer,

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Nick Burch
On Mon, 5 Feb 2018, Chris Mattmann wrote: Let's have a go at implementing it! You know my thoughts (make it like OODT ;) )\ I'm still keen to hear how we can do the text content like OODT! I have tried to copy the OODT model for the proposed metadata case though :) Nick On 2/5/18, 8:37

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Chris Mattmann
Let's have a go at implementing it! You know my thoughts (make it like OODT ;) )\ On 2/5/18, 8:37 AM, "Nick Burch" wrote: Ping - anyone got any thoughts on the proposed metadata parser stuff, and any ideas on the content part? On Tue, 2 Jan 2018, Nick

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Nick Burch
Ping - anyone got any thoughts on the proposed metadata parser stuff, and any ideas on the content part? On Tue, 2 Jan 2018, Nick Burch wrote: On Thu, 26 Oct 2017, Chris Mattmann wrote: On collision, the precedence order defines what key takes precedence and _overwrites_ the other. Overwrite

Re: Not-yet-broken breaking changes for Tika 2?

2018-01-02 Thread Nick Burch
Sorry to ignore this for so long... On Thu, 26 Oct 2017, Chris Mattmann wrote: On collision, the precedence order defines what key takes precedence and _overwrites_ the other. Overwrite is but one option (you could save *all* the values it’s a multi-valued key structure so…) OK, I think

Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Chris Mattmann
On collision, the precedence order defines what key takes precedence and _overwrites_ the other. Overwrite is but one option (you could save *all* the values it’s a multi-valued key structure so…) Cheers, Chris On 10/26/17, 9:43 AM, "Nick Burch" wrote: On Thu, 26

Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Nick Burch
On Thu, 26 Oct 2017, Chris Mattmann wrote: My general approach to conflicting metadata is simply to define precedence orders. For example here is one documented from OODT: https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence We can do similar things

Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Chris Mattmann
Thanks Nick. My general approach to conflicting metadata is simply to define precedence orders. For example here is one documented from OODT: https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence We can do similar things with Tika, e.g.,

Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Nick Burch
On Thu, 26 Oct 2017, Chris Mattmann wrote: Why don’t we just store N copies of the stream, and parse it twice? I'm not sure that's the challenge though? Using TikaInputStream we can buffer to a temp file if needed to re-read the input Of course that’s the ugly way, but currently the way

Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Chris Mattmann
Why don’t we just store N copies of the stream, and parse it twice? Of course that’s the ugly way, but currently the way I’ve hacked this in all of my projects is simply to call Tika N times OUTSIDE of Tika. Why don’t we just use that as the weakest baseline and work backwards from there? Chris

RE: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Allison, Timothy B.
At this point, I'm willing to punt to 3.x, unless there's momentum for either of these two. They would be great to have! 1) chaining multiple parsers -- additive This shouldn't be too bad, except where there's conflicting metadata -- parser1 says author is 'bob', parser2 says author is