Re: Plans for the first Tika 2.0 release
NLP/NER is as high a priority to me as the OCR stuff..we have a whole meta framework for doing NER/NLP with NERRecogniser and really cool Tensorflow and other stuff. Hoping 2.0 can help solve this! ☺ ++ Chris Mattmann, Ph.D. Chief Architect, Instrument Software and Science Data Systems Section (398) Manager, Open Source Projects Formulation and Development Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++ On 9/21/16, 7:40 AM, "Nick Burch"wrote: On Mon, 19 Sep 2016, Bob Paulin wrote: > I think it's a good thing to discuss. I know there are other features > that are targeted for 2.0. Do we have a general sense of where those > features are at? I think the big one we need to crack is allowing multiple parsers to run against a file. OCR is probably the most critical of these from the modularisation perspective, with all those nasty interlinkings between the parsers to allow the manual delegation. If we can crack the problem of multiple parsers, those proxy issues should go away (or at least get better!) As a bonus, it ought to also improve things for error cases (fallback parsers etc), but for your needs, the simplification for "ocr + image metadata" is likely your biggest win! (I think it might also let us tidy up some of the enhancement parsers too, like how the NLP stuff fits into the parsing framework) Nick
Re: Plans for the first Tika 2.0 release
On Mon, 19 Sep 2016, Bob Paulin wrote: I think it's a good thing to discuss. I know there are other features that are targeted for 2.0. Do we have a general sense of where those features are at? I think the big one we need to crack is allowing multiple parsers to run against a file. OCR is probably the most critical of these from the modularisation perspective, with all those nasty interlinkings between the parsers to allow the manual delegation. If we can crack the problem of multiple parsers, those proxy issues should go away (or at least get better!) As a bonus, it ought to also improve things for error cases (fallback parsers etc), but for your needs, the simplification for "ocr + image metadata" is likely your biggest win! (I think it might also let us tidy up some of the enhancement parsers too, like how the NLP stuff fits into the parsing framework) Nick
Re: Plans for the first Tika 2.0 release
I think that could work! I've also created a custom filter that might help https://issues.apache.org/jira/browse/TIKA-2083?filter=12338448 Logic is as follows: project = TIKA AND affectedVersion = 2.0 AND priority >= Blocker AND status != Closed AND status != Fixed - Bob On 9/19/2016 1:40 PM, Allison, Timothy B. wrote: Should we create a tika-2_0-blocker label to differentiate from regular "blockers"? How about a single master issue: TIKA-2085. What else do we need to add?
RE: Plans for the first Tika 2.0 release
> Should we create a tika-2_0-blocker label to differentiate from regular > "blockers"? How about a single master issue: TIKA-2085. What else do we need to add?
RE: Plans for the first Tika 2.0 release
>> 1) Implement various strategies for chaining multiple parsers against >> individual files. Much of this has been implemented, but what's holding us >> up on this one (I think?) is a resettable outputstream. >I think we need a JIRA for this. Is there any existing design ideas on how >this would be achieved? Opened TIKA-2084 as subtask of TIKA-1509 > 2) Rich metadata (TIKA-1607) This is great. I think we need to ensure we have JIRAs for all the features we consider blockers and label them as such. This looks like there's a lot of good discussion. It also references TIKA-1903 so is that also a Tika 2.0 blocker? TIKA-1903 is not a blocker on 2.0, and may be obviated by TIKA-1607. >> 1) Get rid of old metadata tags in favor of "new" Dublin core >Need JIRA? Sorry, opened a good while ago: TIKA-1974 > If we can't get a date we should at least try to eliminate the ???. I think > we need to close down the feature set. Y, completely agree. Should we create a tika-2_0-blocker label to differentiate from regular "blockers"?
Re: Plans for the first Tika 2.0 release
Thanks Tim! Replies in line. - Bob On 9/19/2016 12:33 PM, Allison, Timothy B. wrote: Bob, As always, thank you for driving 2.0! My concern is we have been dual maintaining 2 branches for about 9 months. I think the longer we do this the more risk there is that we miss something. Agreed. I think we're already missing a few things. Yikes is there a way we can audit what we might have missed? Perhaps we need a JIRA to do an audit of the commits in master and do a best effort of what might have been missed? I can create the JIRA for this. Would it make sense to at least put a date out there for a feature cut off? I'd be hesitant to do this. To my mind, the key is the actual features and devs who have time to implement them. Ok this is a start to understand what the blocking features are. The key will be creating concrete JIRAs for them and identifying where we are at. For me, the blocking new features are: 1) Implement various strategies for chaining multiple parsers against individual files. Much of this has been implemented, but what's holding us up on this one (I think?) is a resettable outputstream. I think we need a JIRA for this. Is there any existing design ideas on how this would be achieved? 2) Rich metadata (TIKA-1607) This is great. I think we need to ensure we have JIRAs for all the features we consider blockers and label them as such. This looks like there's a lot of good discussion. It also references TIKA-1903 so is that also a Tika 2.0 blocker? The blocking tasks: 1) Get rid of old metadata tags in favor of "new" Dublin core Need JIRA? 2) ??? If we can't get a date we should at least try to eliminate the ???. I think we need to close down the feature set. I'm full up on other stuff at the moment, perhaps after we get 1.14 out, I can turn to 2.0-specific development. What else do we have to do? Anyone else have some time? Yes please would be great to see if there are people that want to own work on the above features. Once we have JIRAs we can post to the Apache Help Wanted page as well. Thanks! Cheers, Tim -Original Message- From: Bob Paulin [mailto:b...@bobpaulin.com] Sent: Monday, September 19, 2016 10:32 AM To: dev@tika.apache.org Subject: Re: Plans for the first Tika 2.0 release Hi, I think it's a good thing to discuss. I know there are other features that are targeted for 2.0. Do we have a general sense of where those features are at? My concern is we have been dual maintaining 2 branches for about 9 months. I think the longer we do this the more risk there is that we miss something. Would it make sense to at least put a date out there for a feature cut off? There's always 3.0 if things are not close to being ready. - Bob
RE: Plans for the first Tika 2.0 release
Bob, As always, thank you for driving 2.0! > My concern is we have been dual maintaining 2 branches for about 9 months. I > think the longer we do this the more risk there is that we miss something. Agreed. I think we're already missing a few things. > Would it make sense to at least put a date out there for a feature cut off? I'd be hesitant to do this. To my mind, the key is the actual features and devs who have time to implement them. For me, the blocking new features are: 1) Implement various strategies for chaining multiple parsers against individual files. Much of this has been implemented, but what's holding us up on this one (I think?) is a resettable outputstream. 2) Rich metadata (TIKA-1607) The blocking tasks: 1) Get rid of old metadata tags in favor of "new" Dublin core 2) ??? I'm full up on other stuff at the moment, perhaps after we get 1.14 out, I can turn to 2.0-specific development. What else do we have to do? Anyone else have some time? Cheers, Tim -Original Message- From: Bob Paulin [mailto:b...@bobpaulin.com] Sent: Monday, September 19, 2016 10:32 AM To: dev@tika.apache.org Subject: Re: Plans for the first Tika 2.0 release Hi, I think it's a good thing to discuss. I know there are other features that are targeted for 2.0. Do we have a general sense of where those features are at? My concern is we have been dual maintaining 2 branches for about 9 months. I think the longer we do this the more risk there is that we miss something. Would it make sense to at least put a date out there for a feature cut off? There's always 3.0 if things are not close to being ready. - Bob
Re: Plans for the first Tika 2.0 release
Hi, I think it's a good thing to discuss. I know there are other features that are targeted for 2.0. Do we have a general sense of where those features are at? My concern is we have been dual maintaining 2 branches for about 9 months. I think the longer we do this the more risk there is that we miss something. Would it make sense to at least put a date out there for a feature cut off? There's always 3.0 if things are not close to being ready. - Bob On 9/19/2016 4:32 AM, Sergey Beryozkin wrote: Hi All Back in May I updated one of our CXF demos on the master 3.2 branch to depend on Tika 2.0 SNAPSHOT to verify the new module system works well. It is feasible that CXF 3.2.0 may be released by the end of the year or early next year. As far as Tika 2.0 dependencies are concerned it will be easy for me to update the demo to temporarily depend on Tika 1.13 or 1.14. But if Tika 2.0 is released by the time CXF 3.2 is about to be released then I'll be happy to keep 2.0 deps. Are there any plans to get Tika 2.0 out in the next few months ? Cheers, Sergey
Plans for the first Tika 2.0 release
Hi All Back in May I updated one of our CXF demos on the master 3.2 branch to depend on Tika 2.0 SNAPSHOT to verify the new module system works well. It is feasible that CXF 3.2.0 may be released by the end of the year or early next year. As far as Tika 2.0 dependencies are concerned it will be easy for me to update the demo to temporarily depend on Tika 1.13 or 1.14. But if Tika 2.0 is released by the time CXF 3.2 is about to be released then I'll be happy to keep 2.0 deps. Are there any plans to get Tika 2.0 out in the next few months ? Cheers, Sergey