RE: Tika 1.14?
Woohoo! Thank you! -Original Message- From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Thursday, September 29, 2016 1:27 PM To: dev@tika.apache.org Subject: Re: Tika 1.14? If there aren’t any objections I’ll roll 1.14 this weekend with an RC1 by Monday. ++ Chris Mattmann, Ph.D. Chief Architect, Instrument Software and Science Data Systems Section (398) Manager, Open Source Projects Formulation and Development Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++ On 9/29/16, 8:07 AM, "Allison, Timothy B." wrote: I didn't find any showstoppers. Are we ready for Chris to roll 1.14-rc1? Some notes: We're getting quite a few new attachments: 315k (mostly from newly recognized mbox, and MSOffice) New mimetypes: mbox, text/calendar, x-sh, vnd.djvu, dbf, and many more The upgraded copy of icu4j is misidentifying a handful of files as UTF-16[LB]E. We're missing a small amount of text from custom PPT templates (known issue) We're getting quite a few new exceptions for attachments that weren't formerly extracted. These are unknown embedded objects that are being misidentified as PSD, other image files or TTF. We're getting quite a few new exceptions for files that are now correctly identified as "x-ms-asx" because they contain invalid xml -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, September 28, 2016 1:34 PM To: dev@tika.apache.org Subject: RE: Tika 1.14? All, I finished running the regression tests. I have just started going through the results. Reports are available here: https://github.com/tballison/share/blob/master/tika_comparisons/reports_1_14-trunk_vs_1_13.zip -Original Message- From: Chris Mattmann [mailto:mattm...@apache.org] Sent: Thursday, September 22, 2016 12:25 PM To: dev@tika.apache.org Subject: Re: Tika 1.14? Sounds great to me Tim. If you tell me when the tests are done, I’d be happy to RC a release! On 9/21/16, 11:31 AM, "Allison, Timothy B." wrote: All, PDFBox 2.0.3 is now integrated, I'm about to push the integration with POI-3.15. I have a few cleanup things I'd like to take care of. Any other items for 1.14? Should we aim for Mon 26th for final code changes for 1.14? I can run the regression tests, and then maybe we could cut the release candidate some time mid to end of next week? Best, Tim
Re: Tika 1.14?
If there aren’t any objections I’ll roll 1.14 this weekend with an RC1 by Monday. ++ Chris Mattmann, Ph.D. Chief Architect, Instrument Software and Science Data Systems Section (398) Manager, Open Source Projects Formulation and Development Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++ On 9/29/16, 8:07 AM, "Allison, Timothy B." wrote: I didn't find any showstoppers. Are we ready for Chris to roll 1.14-rc1? Some notes: We're getting quite a few new attachments: 315k (mostly from newly recognized mbox, and MSOffice) New mimetypes: mbox, text/calendar, x-sh, vnd.djvu, dbf, and many more The upgraded copy of icu4j is misidentifying a handful of files as UTF-16[LB]E. We're missing a small amount of text from custom PPT templates (known issue) We're getting quite a few new exceptions for attachments that weren't formerly extracted. These are unknown embedded objects that are being misidentified as PSD, other image files or TTF. We're getting quite a few new exceptions for files that are now correctly identified as "x-ms-asx" because they contain invalid xml -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, September 28, 2016 1:34 PM To: dev@tika.apache.org Subject: RE: Tika 1.14? All, I finished running the regression tests. I have just started going through the results. Reports are available here: https://github.com/tballison/share/blob/master/tika_comparisons/reports_1_14-trunk_vs_1_13.zip -Original Message- From: Chris Mattmann [mailto:mattm...@apache.org] Sent: Thursday, September 22, 2016 12:25 PM To: dev@tika.apache.org Subject: Re: Tika 1.14? Sounds great to me Tim. If you tell me when the tests are done, I’d be happy to RC a release! On 9/21/16, 11:31 AM, "Allison, Timothy B." wrote: All, PDFBox 2.0.3 is now integrated, I'm about to push the integration with POI-3.15. I have a few cleanup things I'd like to take care of. Any other items for 1.14? Should we aim for Mon 26th for final code changes for 1.14? I can run the regression tests, and then maybe we could cut the release candidate some time mid to end of next week? Best, Tim
RE: Tika 1.14?
I didn't find any showstoppers. Are we ready for Chris to roll 1.14-rc1? Some notes: We're getting quite a few new attachments: 315k (mostly from newly recognized mbox, and MSOffice) New mimetypes: mbox, text/calendar, x-sh, vnd.djvu, dbf, and many more The upgraded copy of icu4j is misidentifying a handful of files as UTF-16[LB]E. We're missing a small amount of text from custom PPT templates (known issue) We're getting quite a few new exceptions for attachments that weren't formerly extracted. These are unknown embedded objects that are being misidentified as PSD, other image files or TTF. We're getting quite a few new exceptions for files that are now correctly identified as "x-ms-asx" because they contain invalid xml -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, September 28, 2016 1:34 PM To: dev@tika.apache.org Subject: RE: Tika 1.14? All, I finished running the regression tests. I have just started going through the results. Reports are available here: https://github.com/tballison/share/blob/master/tika_comparisons/reports_1_14-trunk_vs_1_13.zip -Original Message- From: Chris Mattmann [mailto:mattm...@apache.org] Sent: Thursday, September 22, 2016 12:25 PM To: dev@tika.apache.org Subject: Re: Tika 1.14? Sounds great to me Tim. If you tell me when the tests are done, I’d be happy to RC a release! On 9/21/16, 11:31 AM, "Allison, Timothy B." wrote: All, PDFBox 2.0.3 is now integrated, I'm about to push the integration with POI-3.15. I have a few cleanup things I'd like to take care of. Any other items for 1.14? Should we aim for Mon 26th for final code changes for 1.14? I can run the regression tests, and then maybe we could cut the release candidate some time mid to end of next week? Best, Tim
RE: Tika 1.14?
All, I finished running the regression tests. I have just started going through the results. Reports are available here: https://github.com/tballison/share/blob/master/tika_comparisons/reports_1_14-trunk_vs_1_13.zip -Original Message- From: Chris Mattmann [mailto:mattm...@apache.org] Sent: Thursday, September 22, 2016 12:25 PM To: dev@tika.apache.org Subject: Re: Tika 1.14? Sounds great to me Tim. If you tell me when the tests are done, I’d be happy to RC a release! On 9/21/16, 11:31 AM, "Allison, Timothy B." wrote: All, PDFBox 2.0.3 is now integrated, I'm about to push the integration with POI-3.15. I have a few cleanup things I'd like to take care of. Any other items for 1.14? Should we aim for Mon 26th for final code changes for 1.14? I can run the regression tests, and then maybe we could cut the release candidate some time mid to end of next week? Best, Tim
RE: Tika 1.14?
Thank you, Chris! -Original Message- From: Chris Mattmann [mailto:mattm...@apache.org] Sent: Thursday, September 22, 2016 12:25 PM To: dev@tika.apache.org Subject: Re: Tika 1.14? Sounds great to me Tim. If you tell me when the tests are done, I’d be happy to RC a release! On 9/21/16, 11:31 AM, "Allison, Timothy B." wrote: All, PDFBox 2.0.3 is now integrated, I'm about to push the integration with POI-3.15. I have a few cleanup things I'd like to take care of. Any other items for 1.14? Should we aim for Mon 26th for final code changes for 1.14? I can run the regression tests, and then maybe we could cut the release candidate some time mid to end of next week? Best, Tim
Re: Tika 1.14?
Sounds great to me Tim. If you tell me when the tests are done, I’d be happy to RC a release! On 9/21/16, 11:31 AM, "Allison, Timothy B." wrote: All, PDFBox 2.0.3 is now integrated, I'm about to push the integration with POI-3.15. I have a few cleanup things I'd like to take care of. Any other items for 1.14? Should we aim for Mon 26th for final code changes for 1.14? I can run the regression tests, and then maybe we could cut the release candidate some time mid to end of next week? Best, Tim
RE: Tika 1.14?
All, PDFBox 2.0.3 is now integrated, I'm about to push the integration with POI-3.15. I have a few cleanup things I'd like to take care of. Any other items for 1.14? Should we aim for Mon 26th for final code changes for 1.14? I can run the regression tests, and then maybe we could cut the release candidate some time mid to end of next week? Best, Tim
RE: Tika 1.14?
Let me touch back in a month. ;) Looks like PDFBox 2.0.3 and POI-3.15-beta3 or POI-3.15-final will be out shortly. Any blockers/wishes on 1.14? -Original Message- From: lewis john mcgibbney [mailto:lewi...@apache.org] Sent: Friday, August 12, 2016 7:51 PM To: dev@tika.apache.org Subject: Re: Tika 1.14? Good thread Tim, Regarding open issues and low hanging fruit to make it into 1.14, I will also work on finishing https://github.com/apache/tika/pull/112. I think Bob has an excellent point. The 2.X work is major and would be a big step in the right direction. Having both branches longer and longer is going to end up nasty given more time. I'll try to touch back here in a week or so and see where we are.
Re: Tika 1.14?
Good thread Tim, Regarding open issues and low hanging fruit to make it into 1.14, I will also work on finishing https://github.com/apache/tika/pull/112. I think Bob has an excellent point. The 2.X work is major and would be a big step in the right direction. Having both branches longer and longer is going to end up nasty given more time. I'll try to touch back here in a week or so and see where we are. On Fri, Aug 12, 2016 at 4:24 AM, wrote: > > From: "Allison, Timothy B." > To: "dev@tika.apache.org" > Cc: > Date: Thu, 11 Aug 2016 18:59:56 + > Subject: Tika 1.14? > All, > Any interest in a Tika 1.14 release in a few weeks, say first week of > September? I'd like to test and integrate POI 3.15-beta3 which should be > out fairly soon. Any other blockers or wishes? > > -- http://home.apache.org/~lewismc/ @hectorMcSpector http://www.linkedin.com/in/lmcgibbney
Re: Tika 1.14?
1508, and 1680 are pending me/my review. I’ll get it done today. On 8/12/16, 4:24 AM, "Allison, Timothy B." wrote: >> I know it's been a little bit since we talked about 2.0. We had discussed holding off while some API changes that were under consideration. Has any progress been made on this? > I think we're still trying to come up with a plan for how to allow multiple parsers to report text for one And > I believe we've also still got the issue of structured metadata outstanding. Y, I agree on both for 2.0. Anything else that we need to get into 1.14? Sounds like there's a chance we'll be able to get PDFBox 2.0.3 into 1.14... Thamme, any interest in wrapping up TIKA-1508/TIKA-1680/TIKA-1986 in the next few weeks? Cheers, Tim
RE: Tika 1.14?
I think waiting for pdfbox 2.0.3 would be great. There are some regressions fixed. Regards, Luis Em 12 de ago de 2016 08:24, "Allison, Timothy B." escreveu: > >> I know it's been a little bit since we talked about 2.0. We had > discussed holding off while some API changes that were under > consideration. Has any progress been made on this? > > > I think we're still trying to come up with a plan for how to allow > multiple parsers to report text for one > > And > > I believe we've also still got the issue of structured metadata > outstanding. > > Y, I agree on both for 2.0. Anything else that we need to get into 1.14? > Sounds like there's a chance we'll be able to get PDFBox 2.0.3 into 1.14... > > Thamme, any interest in wrapping up TIKA-1508/TIKA-1680/TIKA-1986 in the > next few weeks? > > Cheers, > > Tim > >
RE: Tika 1.14?
>> I know it's been a little bit since we talked about 2.0. We had discussed >> holding off while some API changes that were under consideration. Has any >> progress been made on this? > I think we're still trying to come up with a plan for how to allow multiple > parsers to report text for one And > I believe we've also still got the issue of structured metadata outstanding. Y, I agree on both for 2.0. Anything else that we need to get into 1.14? Sounds like there's a chance we'll be able to get PDFBox 2.0.3 into 1.14... Thamme, any interest in wrapping up TIKA-1508/TIKA-1680/TIKA-1986 in the next few weeks? Cheers, Tim
Re: Tika 1.14?
I believe we've also still got the issue of structured metadata outstanding. Regards, Ray > On Aug 12, 2016, at 6:27 AM, Nick Burch wrote: > > On Thu, 11 Aug 2016, Bob Paulin wrote: >> I know it's been a little bit since we talked about 2.0. We had discussed >> holding off while some API changes that were under consideration. Has any >> progress been made on this? > > I think we're still trying to come up with a plan for how to allow multiple > parsers to report text for one document (either for main parser + fallback > parser after error, or for two different kinds of parsers). That's then > blocking some of the changes around fallback parsers, multiple parsers etc. > Probably a few other API breaks / changes that'll fall out of that too > > How long we want to wait for a solution for that is a different question... > (I'm not that keen on saying "if the content handler is a TikaContentHandler > with some extra methods, great, otherwise throw an exception for >1 parser", > which is about the only one we've come up with so far) > > Nick
Re: Tika 1.14?
On Thu, 11 Aug 2016, Bob Paulin wrote: I know it's been a little bit since we talked about 2.0. We had discussed holding off while some API changes that were under consideration. Has any progress been made on this? I think we're still trying to come up with a plan for how to allow multiple parsers to report text for one document (either for main parser + fallback parser after error, or for two different kinds of parsers). That's then blocking some of the changes around fallback parsers, multiple parsers etc. Probably a few other API breaks / changes that'll fall out of that too How long we want to wait for a solution for that is a different question... (I'm not that keen on saying "if the content handler is a TikaContentHandler with some extra methods, great, otherwise throw an exception for >1 parser", which is about the only one we've come up with so far) Nick
Re: Tika 1.14?
I know it's been a little bit since we talked about 2.0. We had discussed holding off while some API changes that were under consideration. Has any progress been made on this? The community has been really good about dual maintaining but how much longer do we want to have this expectation? Can we consider cutting 2.0 without some of these changes and do them in 3.0 or is that a non-starter? Other thoughts? - Bob Paulin On 8/11/2016 2:29 PM, Mattmann, Chris A (3980) wrote: Sounds good to me ++ Chris Mattmann, Ph.D. Chief Architect, Instrument Software and Science Data Systems Section (398) Manager, Open Source Projects Formulation and Development Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++ On 8/11/16, 11:59 AM, "Allison, Timothy B." wrote: All, Any interest in a Tika 1.14 release in a few weeks, say first week of September? I'd like to test and integrate POI 3.15-beta3 which should be out fairly soon. Any other blockers or wishes? Cheers, Tim
Re: Tika 1.14?
Sounds good to me ++ Chris Mattmann, Ph.D. Chief Architect, Instrument Software and Science Data Systems Section (398) Manager, Open Source Projects Formulation and Development Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++ On 8/11/16, 11:59 AM, "Allison, Timothy B." wrote: All, Any interest in a Tika 1.14 release in a few weeks, say first week of September? I'd like to test and integrate POI 3.15-beta3 which should be out fairly soon. Any other blockers or wishes? Cheers, Tim