RE: Tika 1.14?

2016-09-29 Thread Allison, Timothy B.
Woohoo!  Thank you!

-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Thursday, September 29, 2016 1:27 PM
To: dev@tika.apache.org
Subject: Re: Tika 1.14?

If there aren’t any objections I’ll roll 1.14 this weekend with an RC1 by 
Monday.

++
Chris Mattmann, Ph.D.
Chief Architect, Instrument Software and Science Data Systems Section (398) 
Manager, Open Source Projects Formulation and Development Office (8212) NASA 
Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate 
Professor, Computer Science Department University of Southern California, Los 
Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++
 

On 9/29/16, 8:07 AM, "Allison, Timothy B."  wrote:

I didn't find any showstoppers.  Are we ready for Chris to roll 1.14-rc1?


Some notes:
We're getting quite a few new attachments: 315k (mostly from newly 
recognized mbox, and MSOffice)
New mimetypes: mbox, text/calendar, x-sh, vnd.djvu, dbf, and many more
The upgraded copy of icu4j is misidentifying a handful of files as 
UTF-16[LB]E.
We're missing a small amount of text from custom PPT templates (known issue)
We're getting quite a few new exceptions for attachments that weren't 
formerly extracted.  These are unknown embedded objects that are being 
misidentified as PSD, other image files or TTF. 
We're getting quite a few new exceptions for files that are now correctly 
identified as "x-ms-asx" because they contain invalid xml


-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Wednesday, September 28, 2016 1:34 PM
To: dev@tika.apache.org
Subject: RE: Tika 1.14?

All,
  I finished running the regression tests.  I have just started going 
through the results.

Reports are available here:


https://github.com/tballison/share/blob/master/tika_comparisons/reports_1_14-trunk_vs_1_13.zip



-Original Message-
From: Chris Mattmann [mailto:mattm...@apache.org] 
Sent: Thursday, September 22, 2016 12:25 PM
To: dev@tika.apache.org
Subject: Re: Tika 1.14?

Sounds great to me Tim. If you tell me when the tests are done, I’d be 
happy to RC a release!





On 9/21/16, 11:31 AM, "Allison, Timothy B."  wrote:

All,
  PDFBox 2.0.3 is now integrated, I'm about to push the integration 
with POI-3.15.  I have a few cleanup things I'd like to take care of.
  Any other items for 1.14?
  Should we aim for Mon 26th for final code changes for 1.14?  I can 
run the regression tests, and then maybe we could cut the release candidate 
some time mid to end of next week?

   Best,

   Tim









Re: Tika 1.14?

2016-09-29 Thread Mattmann, Chris A (3980)
If there aren’t any objections I’ll roll 1.14 this weekend with an RC1 by 
Monday.

++
Chris Mattmann, Ph.D.
Chief Architect, Instrument Software and Science Data Systems Section (398)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++
 

On 9/29/16, 8:07 AM, "Allison, Timothy B."  wrote:

I didn't find any showstoppers.  Are we ready for Chris to roll 1.14-rc1?


Some notes:
We're getting quite a few new attachments: 315k (mostly from newly 
recognized mbox, and MSOffice)
New mimetypes: mbox, text/calendar, x-sh, vnd.djvu, dbf, and many more
The upgraded copy of icu4j is misidentifying a handful of files as 
UTF-16[LB]E.
We're missing a small amount of text from custom PPT templates (known issue)
We're getting quite a few new exceptions for attachments that weren't 
formerly extracted.  These are unknown embedded objects that are being 
misidentified as PSD, other image files or TTF. 
We're getting quite a few new exceptions for files that are now correctly 
identified as "x-ms-asx" because they contain invalid xml


-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Wednesday, September 28, 2016 1:34 PM
To: dev@tika.apache.org
Subject: RE: Tika 1.14?

All,
  I finished running the regression tests.  I have just started going 
through the results.

Reports are available here:


https://github.com/tballison/share/blob/master/tika_comparisons/reports_1_14-trunk_vs_1_13.zip



-Original Message-
From: Chris Mattmann [mailto:mattm...@apache.org] 
Sent: Thursday, September 22, 2016 12:25 PM
To: dev@tika.apache.org
Subject: Re: Tika 1.14?

Sounds great to me Tim. If you tell me when the tests are done, I’d be 
happy to RC a release!





On 9/21/16, 11:31 AM, "Allison, Timothy B."  wrote:

All,
  PDFBox 2.0.3 is now integrated, I'm about to push the integration 
with POI-3.15.  I have a few cleanup things I'd like to take care of.
  Any other items for 1.14?
  Should we aim for Mon 26th for final code changes for 1.14?  I can 
run the regression tests, and then maybe we could cut the release candidate 
some time mid to end of next week?

   Best,

   Tim









RE: Tika 1.14?

2016-09-29 Thread Allison, Timothy B.
I didn't find any showstoppers.  Are we ready for Chris to roll 1.14-rc1?


Some notes:
We're getting quite a few new attachments: 315k (mostly from newly recognized 
mbox, and MSOffice)
New mimetypes: mbox, text/calendar, x-sh, vnd.djvu, dbf, and many more
The upgraded copy of icu4j is misidentifying a handful of files as UTF-16[LB]E.
We're missing a small amount of text from custom PPT templates (known issue)
We're getting quite a few new exceptions for attachments that weren't formerly 
extracted.  These are unknown embedded objects that are being misidentified as 
PSD, other image files or TTF. 
We're getting quite a few new exceptions for files that are now correctly 
identified as "x-ms-asx" because they contain invalid xml


-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Wednesday, September 28, 2016 1:34 PM
To: dev@tika.apache.org
Subject: RE: Tika 1.14?

All,
  I finished running the regression tests.  I have just started going through 
the results.

Reports are available here:

https://github.com/tballison/share/blob/master/tika_comparisons/reports_1_14-trunk_vs_1_13.zip



-Original Message-
From: Chris Mattmann [mailto:mattm...@apache.org] 
Sent: Thursday, September 22, 2016 12:25 PM
To: dev@tika.apache.org
Subject: Re: Tika 1.14?

Sounds great to me Tim. If you tell me when the tests are done, I’d be happy to 
RC a release!





On 9/21/16, 11:31 AM, "Allison, Timothy B."  wrote:

All,
  PDFBox 2.0.3 is now integrated, I'm about to push the integration with 
POI-3.15.  I have a few cleanup things I'd like to take care of.
  Any other items for 1.14?
  Should we aim for Mon 26th for final code changes for 1.14?  I can run 
the regression tests, and then maybe we could cut the release candidate some 
time mid to end of next week?

   Best,

   Tim







RE: Tika 1.14?

2016-09-28 Thread Allison, Timothy B.
All,
  I finished running the regression tests.  I have just started going through 
the results.

Reports are available here:

https://github.com/tballison/share/blob/master/tika_comparisons/reports_1_14-trunk_vs_1_13.zip



-Original Message-
From: Chris Mattmann [mailto:mattm...@apache.org] 
Sent: Thursday, September 22, 2016 12:25 PM
To: dev@tika.apache.org
Subject: Re: Tika 1.14?

Sounds great to me Tim. If you tell me when the tests are done, I’d be happy to 
RC a release!





On 9/21/16, 11:31 AM, "Allison, Timothy B."  wrote:

All,
  PDFBox 2.0.3 is now integrated, I'm about to push the integration with 
POI-3.15.  I have a few cleanup things I'd like to take care of.
  Any other items for 1.14?
  Should we aim for Mon 26th for final code changes for 1.14?  I can run 
the regression tests, and then maybe we could cut the release candidate some 
time mid to end of next week?

   Best,

   Tim







RE: Tika 1.14?

2016-09-22 Thread Allison, Timothy B.
Thank you, Chris!

-Original Message-
From: Chris Mattmann [mailto:mattm...@apache.org] 
Sent: Thursday, September 22, 2016 12:25 PM
To: dev@tika.apache.org
Subject: Re: Tika 1.14?

Sounds great to me Tim. If you tell me when the tests are done, I’d be happy to 
RC a release!





On 9/21/16, 11:31 AM, "Allison, Timothy B."  wrote:

All,
  PDFBox 2.0.3 is now integrated, I'm about to push the integration with 
POI-3.15.  I have a few cleanup things I'd like to take care of.
  Any other items for 1.14?
  Should we aim for Mon 26th for final code changes for 1.14?  I can run 
the regression tests, and then maybe we could cut the release candidate some 
time mid to end of next week?

   Best,

   Tim







Re: Tika 1.14?

2016-09-22 Thread Chris Mattmann
Sounds great to me Tim. If you tell me when the tests are done, I’d be happy to 
RC a release!





On 9/21/16, 11:31 AM, "Allison, Timothy B."  wrote:

All,
  PDFBox 2.0.3 is now integrated, I'm about to push the integration with 
POI-3.15.  I have a few cleanup things I'd like to take care of.
  Any other items for 1.14?
  Should we aim for Mon 26th for final code changes for 1.14?  I can run 
the regression tests, and then maybe we could cut the release candidate some 
time mid to end of next week?

   Best,

   Tim







RE: Tika 1.14?

2016-09-21 Thread Allison, Timothy B.
All,
  PDFBox 2.0.3 is now integrated, I'm about to push the integration with 
POI-3.15.  I have a few cleanup things I'd like to take care of.
  Any other items for 1.14?
  Should we aim for Mon 26th for final code changes for 1.14?  I can run the 
regression tests, and then maybe we could cut the release candidate some time 
mid to end of next week?

   Best,

   Tim



RE: Tika 1.14?

2016-09-15 Thread Allison, Timothy B.
Let me touch back in a month. ;)

Looks like PDFBox 2.0.3 and POI-3.15-beta3 or POI-3.15-final will be out 
shortly.

Any blockers/wishes on 1.14?  

-Original Message-
From: lewis john mcgibbney [mailto:lewi...@apache.org] 
Sent: Friday, August 12, 2016 7:51 PM
To: dev@tika.apache.org
Subject: Re: Tika 1.14?

Good thread Tim,
Regarding open issues and low hanging fruit to make it into 1.14, I will also 
work on finishing https://github.com/apache/tika/pull/112.
I think Bob has an excellent point. The 2.X work is major and would be a big 
step in the right direction. Having both branches longer and longer is going to 
end up nasty given more time.
I'll try to touch back here in a week or so and see where we are.



Re: Tika 1.14?

2016-08-12 Thread lewis john mcgibbney
Good thread Tim,
Regarding open issues and low hanging fruit to make it into 1.14, I will
also work on finishing
https://github.com/apache/tika/pull/112.
I think Bob has an excellent point. The 2.X work is major and would be a
big step in the right direction. Having both branches longer and longer is
going to end up nasty given more time.
I'll try to touch back here in a week or so and see where we are.

On Fri, Aug 12, 2016 at 4:24 AM,  wrote:

>
> From: "Allison, Timothy B." 
> To: "dev@tika.apache.org" 
> Cc:
> Date: Thu, 11 Aug 2016 18:59:56 +
> Subject: Tika 1.14?
> All,
>   Any interest in a Tika 1.14 release in a few weeks, say first week of
> September?  I'd like to test and integrate POI 3.15-beta3 which should be
> out fairly soon.  Any other blockers or wishes?
>
> --
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Re: Tika 1.14?

2016-08-12 Thread Chris Mattmann
1508, and 1680 are pending me/my review. I’ll get it done today.



On 8/12/16, 4:24 AM, "Allison, Timothy B."  wrote:

>> I know it's been a little bit since we talked about 2.0.  We had 
discussed holding off while some API changes that were under consideration.  
Has any progress been made on this?

> I think we're still trying to come up with a plan for how to allow 
multiple parsers to report text for one

And
> I believe we've also still got the issue of structured metadata 
outstanding.

Y, I agree on both for 2.0.  Anything else that we need to get into 1.14?  
Sounds like there's a chance we'll be able to get PDFBox 2.0.3 into 1.14...

Thamme, any interest in wrapping up TIKA-1508/TIKA-1680/TIKA-1986 in the 
next few weeks?

Cheers,

  Tim






RE: Tika 1.14?

2016-08-12 Thread Luís Filipe Nassif
I think waiting for pdfbox 2.0.3 would be great. There are some regressions
fixed.

Regards,
Luis

Em 12 de ago de 2016 08:24, "Allison, Timothy B." 
escreveu:

> >> I know it's been a little bit since we talked about 2.0.  We had
> discussed holding off while some API changes that were under
> consideration.  Has any progress been made on this?
>
> > I think we're still trying to come up with a plan for how to allow
> multiple parsers to report text for one
>
> And
> > I believe we've also still got the issue of structured metadata
> outstanding.
>
> Y, I agree on both for 2.0.  Anything else that we need to get into 1.14?
> Sounds like there's a chance we'll be able to get PDFBox 2.0.3 into 1.14...
>
> Thamme, any interest in wrapping up TIKA-1508/TIKA-1680/TIKA-1986 in the
> next few weeks?
>
> Cheers,
>
>   Tim
>
>


RE: Tika 1.14?

2016-08-12 Thread Allison, Timothy B.
>> I know it's been a little bit since we talked about 2.0.  We had discussed 
>> holding off while some API changes that were under consideration.  Has any 
>> progress been made on this?

> I think we're still trying to come up with a plan for how to allow multiple 
> parsers to report text for one

And
> I believe we've also still got the issue of structured metadata outstanding.

Y, I agree on both for 2.0.  Anything else that we need to get into 1.14?  
Sounds like there's a chance we'll be able to get PDFBox 2.0.3 into 1.14...

Thamme, any interest in wrapping up TIKA-1508/TIKA-1680/TIKA-1986 in the next 
few weeks?

Cheers,

  Tim



Re: Tika 1.14?

2016-08-12 Thread Ray Gauss
I believe we've also still got the issue of structured metadata outstanding.

Regards,

Ray

> On Aug 12, 2016, at 6:27 AM, Nick Burch  wrote:
> 
> On Thu, 11 Aug 2016, Bob Paulin wrote:
>> I know it's been a little bit since we talked about 2.0.  We had discussed 
>> holding off while some API changes that were under consideration.  Has any 
>> progress been made on this?
> 
> I think we're still trying to come up with a plan for how to allow multiple 
> parsers to report text for one document (either for main parser + fallback 
> parser after error, or for two different kinds of parsers). That's then 
> blocking some of the changes around fallback parsers, multiple parsers etc. 
> Probably a few other API breaks / changes that'll fall out of that too
> 
> How long we want to wait for a solution for that is a different question... 
> (I'm not that keen on saying "if the content handler is a TikaContentHandler 
> with some extra methods, great, otherwise throw an exception for >1 parser", 
> which is about the only one we've come up with so far)
> 
> Nick



Re: Tika 1.14?

2016-08-12 Thread Nick Burch

On Thu, 11 Aug 2016, Bob Paulin wrote:
I know it's been a little bit since we talked about 2.0.  We had 
discussed holding off while some API changes that were under 
consideration.  Has any progress been made on this?


I think we're still trying to come up with a plan for how to allow 
multiple parsers to report text for one document (either for main parser + 
fallback parser after error, or for two different kinds of parsers). 
That's then blocking some of the changes around fallback parsers, multiple 
parsers etc. Probably a few other API breaks / changes that'll fall out of 
that too


How long we want to wait for a solution for that is a different 
question... (I'm not that keen on saying "if the content handler is a 
TikaContentHandler with some extra methods, great, otherwise throw an 
exception for >1 parser", which is about the only one we've come up with 
so far)


Nick


Re: Tika 1.14?

2016-08-12 Thread Bob Paulin
I know it's been a little bit since we talked about 2.0.  We had 
discussed holding off while some API changes that were under 
consideration.  Has any progress been made on this?  The community has 
been really good about dual maintaining but how much longer do we want 
to have this expectation?  Can we consider cutting 2.0 without some of 
these changes and do them in 3.0 or is that a non-starter?  Other thoughts?


- Bob Paulin

On 8/11/2016 2:29 PM, Mattmann, Chris A (3980) wrote:

Sounds good to me

++
Chris Mattmann, Ph.D.
Chief Architect, Instrument Software and Science Data Systems Section (398)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++


On 8/11/16, 11:59 AM, "Allison, Timothy B."  wrote:

 All,
   Any interest in a Tika 1.14 release in a few weeks, say first week of 
September?  I'd like to test and integrate POI 3.15-beta3 which should be out 
fairly soon.  Any other blockers or wishes?
 
  Cheers,
 
 Tim
 
 





Re: Tika 1.14?

2016-08-11 Thread Mattmann, Chris A (3980)
Sounds good to me

++
Chris Mattmann, Ph.D.
Chief Architect, Instrument Software and Science Data Systems Section (398)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++


On 8/11/16, 11:59 AM, "Allison, Timothy B."  wrote:

All,
  Any interest in a Tika 1.14 release in a few weeks, say first week of 
September?  I'd like to test and integrate POI 3.15-beta3 which should be out 
fairly soon.  Any other blockers or wishes?

 Cheers,

Tim