Re: Plans for the first Tika 2.0 release

2016-09-21 Thread Mattmann, Chris A (3980)
NLP/NER is as high a priority to me as the OCR stuff..we have a whole meta 
framework
for doing NER/NLP with NERRecogniser and really cool Tensorflow and other stuff.
Hoping 2.0 can help solve this! ☺

++
Chris Mattmann, Ph.D.
Chief Architect, Instrument Software and Science Data Systems Section (398)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++
 


On 9/21/16, 7:40 AM, "Nick Burch"  wrote:

On Mon, 19 Sep 2016, Bob Paulin wrote:
> I think it's a good thing to discuss.  I know there are other features 
> that are targeted for 2.0.  Do we have a general sense of where those 
> features are at?

I think the big one we need to crack is allowing multiple parsers to run 
against a file. OCR is probably the most critical of these from the 
modularisation perspective, with all those nasty interlinkings between the 
parsers to allow the manual delegation. If we can crack the problem of 
multiple parsers, those proxy issues should go away (or at least get 
better!)

As a bonus, it ought to also improve things for error cases (fallback 
parsers etc), but for your needs, the simplification for "ocr + image 
metadata" is likely your biggest win!

(I think it might also let us tidy up some of the enhancement parsers too, 
like how the NLP stuff fits into the parsing framework)

Nick





Re: Plans for the first Tika 2.0 release

2016-09-21 Thread Nick Burch

On Mon, 19 Sep 2016, Bob Paulin wrote:
I think it's a good thing to discuss.  I know there are other features 
that are targeted for 2.0.  Do we have a general sense of where those 
features are at?


I think the big one we need to crack is allowing multiple parsers to run 
against a file. OCR is probably the most critical of these from the 
modularisation perspective, with all those nasty interlinkings between the 
parsers to allow the manual delegation. If we can crack the problem of 
multiple parsers, those proxy issues should go away (or at least get 
better!)


As a bonus, it ought to also improve things for error cases (fallback 
parsers etc), but for your needs, the simplification for "ocr + image 
metadata" is likely your biggest win!


(I think it might also let us tidy up some of the enhancement parsers too, 
like how the NLP stuff fits into the parsing framework)


Nick


Re: Plans for the first Tika 2.0 release

2016-09-19 Thread Bob Paulin

I think that could work!  I've also created a custom filter that might help

https://issues.apache.org/jira/browse/TIKA-2083?filter=12338448

Logic is as follows:

project = TIKA AND affectedVersion = 2.0 AND priority >= Blocker AND 
status != Closed AND status != Fixed



- Bob


On 9/19/2016 1:40 PM, Allison, Timothy B. wrote:

Should we create a tika-2_0-blocker label to differentiate from regular 
"blockers"?

How about a single master issue: TIKA-2085.

What else do we need to add?




RE: Plans for the first Tika 2.0 release

2016-09-19 Thread Allison, Timothy B.
> Should we create a tika-2_0-blocker label to differentiate from regular 
> "blockers"?

How about a single master issue: TIKA-2085.

What else do we need to add?


RE: Plans for the first Tika 2.0 release

2016-09-19 Thread Allison, Timothy B.
>> 1) Implement various strategies for chaining multiple parsers against 
>> individual files.  Much of this has been implemented, but what's holding us 
>> up on this one (I think?) is a resettable outputstream.
>I think we need a JIRA for this.  Is there any existing design ideas on how 
>this would be achieved?
Opened TIKA-2084 as subtask of TIKA-1509

> 2) Rich metadata (TIKA-1607)
This is great.  I think we need to ensure we have JIRAs for all the features we 
consider blockers and label them as such.  This looks like there's a lot of 
good discussion.  It also references TIKA-1903 so is that also a Tika 2.0 
blocker?
TIKA-1903 is not a blocker on 2.0, and may be obviated by TIKA-1607.

>> 1) Get rid of old metadata tags in favor of "new" Dublin core
>Need JIRA?
Sorry, opened a good while ago: TIKA-1974

> If we can't get a date we should at least try to eliminate the ???. I think 
> we need to close down the feature set.
Y, completely agree.

Should we create a tika-2_0-blocker label to differentiate from regular 
"blockers"?


Re: Plans for the first Tika 2.0 release

2016-09-19 Thread Bob Paulin

Thanks Tim!  Replies in line.

- Bob
On 9/19/2016 12:33 PM, Allison, Timothy B. wrote:

Bob,
   As always, thank you for driving 2.0!


My concern is we have been dual maintaining 2 branches for about 9 months.  I 
think the longer we do this the more risk there is that we miss something.

Agreed.  I think we're already missing a few things.
Yikes is there a way we can audit what we might have missed? Perhaps we 
need a JIRA to do an audit of the commits in master and do a best effort 
of what might have been missed?  I can create the JIRA for this.



Would it make sense to at least put a date out there for a feature cut off?

I'd be hesitant to do this.  To my mind, the key is the actual features and 
devs who have time to implement them.
Ok this is a start to understand what the blocking features are. The key 
will be creating concrete JIRAs for them and identifying where we are at.


For me, the blocking new features are:

1) Implement various strategies for chaining multiple parsers against 
individual files.  Much of this has been implemented, but what's holding us up 
on this one (I think?) is a resettable outputstream.
I think we need a JIRA for this.  Is there any existing design ideas on 
how this would be achieved?


2) Rich metadata (TIKA-1607)
This is great.  I think we need to ensure we have JIRAs for all the 
features we consider blockers and label them as such.  This looks like 
there's a lot of good discussion.  It also references TIKA-1903 so is 
that also a Tika 2.0 blocker?


The blocking tasks:
1) Get rid of old metadata tags in favor of "new" Dublin core

Need JIRA?

2) ???
If we can't get a date we should at least try to eliminate the ???. I 
think we need to close down the feature set.


I'm full up on other stuff at the moment, perhaps after we get 1.14 out, I can 
turn to 2.0-specific development.

What else do we have to do? Anyone else have some time?


Yes please would be great to see if there are people that want to own 
work on the above features.  Once we have JIRAs we can post to the 
Apache Help Wanted page as well.


Thanks!



Cheers,

Tim

-Original Message-
From: Bob Paulin [mailto:b...@bobpaulin.com]
Sent: Monday, September 19, 2016 10:32 AM
To: dev@tika.apache.org
Subject: Re: Plans for the first Tika 2.0 release

Hi,

I think it's a good thing to discuss.  I know there are other features that are 
targeted for 2.0.  Do we have a general sense of where those features are at?  
My concern is we have been dual maintaining 2 branches for about 9 months.  I 
think the longer we do this the more risk there is that we miss something.  
Would it make sense to at least put a date
out there for a feature cut off?   There's always 3.0 if things are not
close to being ready.


- Bob






RE: Plans for the first Tika 2.0 release

2016-09-19 Thread Allison, Timothy B.
Bob,
  As always, thank you for driving 2.0!

> My concern is we have been dual maintaining 2 branches for about 9 months.  I 
> think the longer we do this the more risk there is that we miss something.  

Agreed.  I think we're already missing a few things.

> Would it make sense to at least put a date out there for a feature cut off?

I'd be hesitant to do this.  To my mind, the key is the actual features and 
devs who have time to implement them.

For me, the blocking new features are:

1) Implement various strategies for chaining multiple parsers against 
individual files.  Much of this has been implemented, but what's holding us up 
on this one (I think?) is a resettable outputstream.

2) Rich metadata (TIKA-1607)

The blocking tasks:
1) Get rid of old metadata tags in favor of "new" Dublin core
2) ???

I'm full up on other stuff at the moment, perhaps after we get 1.14 out, I can 
turn to 2.0-specific development.

What else do we have to do? Anyone else have some time?

Cheers,

   Tim

-Original Message-
From: Bob Paulin [mailto:b...@bobpaulin.com] 
Sent: Monday, September 19, 2016 10:32 AM
To: dev@tika.apache.org
Subject: Re: Plans for the first Tika 2.0 release

Hi,

I think it's a good thing to discuss.  I know there are other features that are 
targeted for 2.0.  Do we have a general sense of where those features are at?  
My concern is we have been dual maintaining 2 branches for about 9 months.  I 
think the longer we do this the more risk there is that we miss something.  
Would it make sense to at least put a date 
out there for a feature cut off?   There's always 3.0 if things are not 
close to being ready.


- Bob




Re: Plans for the first Tika 2.0 release

2016-09-19 Thread Bob Paulin

Hi,

I think it's a good thing to discuss.  I know there are other features 
that are targeted for 2.0.  Do we have a general sense of where those 
features are at?  My concern is we have been dual maintaining 2 branches 
for about 9 months.  I think the longer we do this the more risk there 
is that we miss something.  Would it make sense to at least put a date 
out there for a feature cut off?   There's always 3.0 if things are not 
close to being ready.



- Bob


On 9/19/2016 4:32 AM, Sergey Beryozkin wrote:

Hi All

Back in May I updated one of our CXF demos on the master 3.2 branch to 
depend on Tika 2.0 SNAPSHOT to verify the new module system works well.
It is feasible that CXF 3.2.0 may be released by the end of the year 
or early next year.
As far as Tika 2.0 dependencies are concerned it will be easy for me 
to update the demo to temporarily depend on Tika 1.13 or 1.14. But if 
Tika 2.0 is released by the time CXF 3.2 is about to be released then 
I'll be happy to keep 2.0 deps.

Are there any plans to get Tika 2.0 out in the next few months ?

Cheers, Sergey








Plans for the first Tika 2.0 release

2016-09-19 Thread Sergey Beryozkin

Hi All

Back in May I updated one of our CXF demos on the master 3.2 branch to 
depend on Tika 2.0 SNAPSHOT to verify the new module system works well.
It is feasible that CXF 3.2.0 may be released by the end of the year or 
early next year.
As far as Tika 2.0 dependencies are concerned it will be easy for me to 
update the demo to temporarily depend on Tika 1.13 or 1.14. But if Tika 
2.0 is released by the time CXF 3.2 is about to be released then I'll be 
happy to keep 2.0 deps.

Are there any plans to get Tika 2.0 out in the next few months ?

Cheers, Sergey