...@mitre.org> wrote:
Do we worry about properly closing tags on an exception?
kaboom
mailto:lfcnas...@gmail.com]
Sent: Monday, February 5, 2018 5:34 PM
To: dev@tika.apache.org
Subject: Re: Not-yet-broken breaking changes for Tika
exception?
>
>
>
>
> kaboom
> mailto:lfcnas...@gmail.com]
> Sent: Monday, February 5, 2018 5:34 PM
> To: dev@tika.apache.org
> Subject: Re: Not-yet-broken breaking changes for Tika 2?
>
> From a forensic use case it is better just saying we are
Do we worry about properly closing tags on an exception?
kaboom
mailto:lfcnas...@gmail.com]
Sent: Monday, February 5, 2018 5:34 PM
To: dev@tika.apache.org
Subject: Re: Not-yet-broken breaking changes for Tika 2?
From a forensic use case it is better just
sage-
From: Mattmann, Chris A (1761) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Monday, February 5, 2018 12:29 PM
To: dev@tika.apache.org
Subject: Re: Not-yet-broken breaking changes for Tika 2?
Our solution is just to run the parser 2xyes I get it will i
Original Message-
From: Mattmann, Chris A (1761) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Monday, February 5, 2018 12:29 PM
To: dev@tika.apache.org
Subject: Re: Not-yet-broken breaking changes for Tika 2?
Our solution is just to run the parser 2xyes I get it will
-broken breaking changes for Tika 2?
Ping - anyone got any thoughts on the proposed metadata parser stuff, and any
ideas on the content part?
On Tue, 2 Jan 2018, Nick Burch wrote:
> On Thu, 26 Oct 2017, Chris Mattmann wrote:
>> On collision, the precedence order defines what key takes p
type that has a reset()
method?
Or do we just say, hey, now we're trying a different parser...
-Original Message-
From: Mattmann, Chris A (1761) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Monday, February 5, 2018 12:29 PM
To: dev@tika.apache.org
Subject: Re: Not-yet-broken breaking
Spool to temp file?
-Original Message-
From: Mattmann, Chris A (1761) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Monday, February 5, 2018 12:29 PM
To: dev@tika.apache.org
Subject: Re: Not-yet-broken breaking changes for Tika 2?
Our solution is just to run the parser 2xyes I get
Our solution is just to run the parser 2xyes I get it will induce overhead,
but as a start, why not?
In short just run through the stream 2x
++
Chris Mattmann, Ph.D.
Associate Chief Technology and Innovation Officer,
On Mon, 5 Feb 2018, Chris Mattmann wrote:
Let's have a go at implementing it! You know my thoughts (make it like
OODT ;) )\
I'm still keen to hear how we can do the text content like OODT!
I have tried to copy the OODT model for the proposed metadata case though
:)
Nick
On 2/5/18, 8:37
Let's have a go at implementing it! You know my thoughts (make it like OODT ;)
)\
On 2/5/18, 8:37 AM, "Nick Burch" wrote:
Ping - anyone got any thoughts on the proposed metadata parser stuff, and
any ideas on the content part?
On Tue, 2 Jan 2018, Nick
Ping - anyone got any thoughts on the proposed metadata parser stuff, and
any ideas on the content part?
On Tue, 2 Jan 2018, Nick Burch wrote:
On Thu, 26 Oct 2017, Chris Mattmann wrote:
On collision, the precedence order defines what key takes precedence and
_overwrites_ the other. Overwrite
Sorry to ignore this for so long...
On Thu, 26 Oct 2017, Chris Mattmann wrote:
On collision, the precedence order defines what key takes precedence and
_overwrites_ the other. Overwrite is but one option (you could save
*all* the values it’s a multi-valued key structure so…)
OK, I think
On collision, the precedence order defines what key takes precedence and
_overwrites_ the
other. Overwrite is but one option (you could save *all* the values it’s a
multi-valued key structure
so…)
Cheers,
Chris
On 10/26/17, 9:43 AM, "Nick Burch" wrote:
On Thu, 26
On Thu, 26 Oct 2017, Chris Mattmann wrote:
My general approach to conflicting metadata is simply to define
precedence orders.
For example here is one documented from OODT:
https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
We can do similar things
Thanks Nick.
My general approach to conflicting metadata is simply to define precedence
orders.
For example here is one documented from OODT:
https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
We can do similar things with Tika, e.g.,
On Thu, 26 Oct 2017, Chris Mattmann wrote:
Why don’t we just store N copies of the stream, and parse it twice?
I'm not sure that's the challenge though? Using TikaInputStream we can
buffer to a temp file if needed to re-read the input
Of course that’s the ugly way, but currently the way
Why don’t we just store N copies of the stream, and parse it twice?
Of course that’s the ugly way, but currently the way I’ve hacked this in all of
my projects is simply to call Tika N times OUTSIDE of Tika. Why don’t we just
use
that as the weakest baseline and work backwards from there?
Chris
At this point, I'm willing to punt to 3.x, unless there's momentum for either
of these two. They would be great to have!
1) chaining multiple parsers -- additive
This shouldn't be too bad, except where there's conflicting metadata -- parser1
says author is 'bob', parser2 says author is
19 matches
Mail list logo