RE: Test document Tika-792

2017-12-05 Thread Allison, Timothy B.
h deleted text. -Original Message----- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, December 04, 2017 10:43 AM To: POI Developers List Subject: RE: Test document Tika-792 I'd prefer to avoid ThreadLocal if possible. Could we add an enum for type of run? Perh

RE: Test document Tika-792

2017-12-04 Thread Allison, Timothy B.
I'd prefer to avoid ThreadLocal if possible. Could we add an enum for type of run? Perhaps use p. 4139-4140 of Ecma ooxml part 1 as the types available? In Tika, we process at the run level; we do not use paragraph's getText(), so I don't think we have any input on deleted text in Paragraph's

RE: classloading xsbs for pptx

2017-11-29 Thread Allison, Timothy B.
You're right. Thank you, Yegor. I swapped out the 3.17-beta1 jars in Solr for the 3.17 jars, and I'm not getting that exception any more. Onward! Cheers, Tim -Original Message- From: Yegor Kozlov [mailto:yegor.koz...@dinom.ru] Sent: Wednesday, November 29, 2017 5:50

RE: classloading xsbs for pptx

2017-11-28 Thread Allison, Timothy B.
at we should avoid leaking to outside libraries--even Tika. On Nov 28, 2017 09:03, "Allison, Timothy B." wrote: All, We have a report that Tika's integration with Solr is now failing proper classloading on a pptx with a CTTable that can't be loaded [1]. The error mess

classloading xsbs for pptx

2017-11-28 Thread Allison, Timothy B.
All, We have a report that Tika's integration with Solr is now failing proper classloading on a pptx with a CTTable that can't be loaded [1]. The error message suggests doing something like this: POIXMLTypeLoader.setClassLoader(CTTable.class.getClassLoader()). Is this the right fix? Should

Running tika-eval on the Rackspace vm

2017-10-23 Thread Allison, Timothy B.
All, If anyone would like to join the fun in running tika-eval on the Rackspace vm, I posted this: https://wiki.apache.org/tika/TikaEvalOnVM . You’ll need access to the vm, of course, but I’m happy to grant that to anyone who wants to chip in and help with regression tests. There are some are

3.17.1?

2017-09-26 Thread Allison, Timothy B.
--- Comment #1 from Javen O'Neal --- Sounds like a 3.17.1 might be in order. +1 What do you all think? -Original Message- From: bugzi...@apache.org [mailto:bugzi...@apache.org] Sent: Tuesday, September 26, 2017 5:23 AM To: dev@poi.apache.org Subject: [Bug 61564] Illegal reflective acc

r1808930 forbidden-api checks and imports

2017-09-21 Thread Allison, Timothy B.
Dominik, Thank you for fixing my new PrintStreams so that the forbidden-api checks would pass...head in hands. I noticed that imports were shortened to wildcards. Should I flip back to listing all? Thank you, again! Best, Tim -import java.io.ByteArrayInputStream; -imp

RE: Apache POI 4.0/Java 8 - new packages?

2017-09-15 Thread Allison, Timothy B.
s. -Original Message----- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Friday, September 15, 2017 7:07 AM To: POI Developers List Subject: RE: Apache POI 4.0/Java 8 Thank you, Dominik!!! So, speaking of 4.0...should we move to semantic versioning: 4.0.0? -Original Message--

RE: Apache POI 4.0/Java 8

2017-09-15 Thread Allison, Timothy B.
Thank you, Dominik!!! So, speaking of 4.0...should we move to semantic versioning: 4.0.0? -Original Message- From: Dominik Stadler [mailto:dominik.stad...@gmx.at] Sent: Thursday, September 14, 2017 1:39 PM To: POI Developers List Subject: Apache POI 4.0/Java 8 Hi, FYI, as 3.17 is out

RE: [VOTE] Apache POI 3.17 release (RC3)

2017-09-11 Thread Allison, Timothy B.
+1 builds on Windows and works in Tika's tests -Original Message- From: Greg Woolsey [mailto:greg.wool...@gmail.com] Sent: Saturday, September 9, 2017 3:34 PM To: POI Developers List Subject: Re: [VOTE] Apache POI 3.17 release (RC3) +1 works-for-me On Sat, Sep 9, 2017, 07:46 Dominik St

SAX v DOM parser for docx

2017-08-31 Thread Allison, Timothy B.
I finally got around to comparing the experimental SAX parser over on Tika with POI/DOM-based parser for docx on the 170k docx files we have. http://162.242.228.174/reports/dom_vs_sax_docx.tar.gz Fewer exceptions...more content. Both are only slight, but overall, this looks promising.

RE: [VOTE] Apache POI 3.17 release (RC2)

2017-08-31 Thread Allison, Timothy B.
Via not in... -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, August 31, 2017 4:39 PM To: POI Developers List Subject: RE: [VOTE] Apache POI 3.17 release (RC2) My fault...fixed in 61475. All good now: http://162.242.228.174/reports/poi-3.17

RE: [VOTE] Apache POI 3.17 release (RC2)

2017-08-31 Thread Allison, Timothy B.
) Wouldn't shock me to find out it is at the XML level - Word saving the same text in two different ways under the same parent would be completely within my jaded expectations. On Thu, Aug 31, 2017 at 11:37 AM Allison, Timothy B. wrote: > I ran the regression tests against docx, and I'

RE: [VOTE] Apache POI 3.17 release (RC2)

2017-08-31 Thread Allison, Timothy B.
e POI level or the Tika level. Reports are here: http://162.242.228.174/reports/poi-3.17-rc2-docx.tar.gz -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, August 30, 2017 8:05 PM To: POI Developers List Subject: RE: [VOTE] Apache POI 3.17 release

RE: [VOTE] Apache POI 3.17 release (RC2)

2017-08-30 Thread Allison, Timothy B.
I’ll run regression tests at least against our .docx tonight to make sure I didn’t wreck anything with 61470.

RE: POI 4.0 and Java 8

2017-08-28 Thread Allison, Timothy B.
Thank you, David! Anyone with contacts at/works for Alfresco? Other stakeholders we should ping? From: David Pilato [mailto:da...@pilato.fr] Sent: Monday, August 28, 2017 10:27 AM To: d...@tika.apache.org; Allison, Timothy B. Cc: POI Developers List Subject: RE: POI 4.0 and Java 8 Nope

RE: POI 4.0 and Java 8

2017-08-28 Thread Allison, Timothy B.
+1 from me. David, any problems with ES if Tika migrates to jdk8? -Original Message- From: Konstantin Gribov [mailto:gros...@gmail.com] Sent: Wednesday, August 23, 2017 1:11 PM To: d...@tika.apache.org Cc: POI Developers List Subject: Re: POI 4.0 and Java 8 Hi, folks. IIRC we moved

RE: Build failed in Jenkins: POI-DSL-Maven #271

2017-07-14 Thread Allison, Timothy B.
Fellow devs, The build started failing before my commits today. I have no doubt that my commits could cause the build to fail :|. My local build worked just fine...not sure what's going on. Best, Tim -Original Message- From: Apache Jenkins Server [mailto:jenk

RE: AddImageBench and org.openjdk.jmh...

2017-07-14 Thread Allison, Timothy B.
. Dominik On Jul 14, 2017 14:01, "Allison, Timothy B." wrote: I'm able to build poi on the commandline, but I'm not able to run unit tests in ooxml in Intellij because of AddImageBench's use of openjdk stuff. Is there a way to work around this? Thank you. Best, Tim

AddImageBench and org.openjdk.jmh...

2017-07-14 Thread Allison, Timothy B.
I'm able to build poi on the commandline, but I'm not able to run unit tests in ooxml in Intellij because of AddImageBench's use of openjdk stuff. Is there a way to work around this? Thank you. Best, Tim

[compress] differences in implementation of Zip ibm vs. oracle?

2017-07-10 Thread Allison, Timothy B.
Compress colleagues, Over on https://bz.apache.org/bugzilla/show_bug.cgi?id=61275, a user submitted two .xlsx files generated with Apache POI, one by IBM's jvm and one by Oracle's jvm. The file generated with Oracle's jvm opens without issue; however, MSOffice complains but can fix the file

RE: FW: Tika content detection and crawled "remote" content

2017-07-05 Thread Allison, Timothy B.
nt: Wednesday, July 5, 2017 8:43 AM To: Allison, Timothy B. Cc: dominik.stad...@gmx.at; POI Developers List (dev@poi.apache.org) Subject: Re: FW: Tika content detection and crawled "remote" content Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika: Ti

FW: Tika content detection and crawled "remote" content

2017-07-05 Thread Allison, Timothy B.
Dominik, Thanks to Sebastian and CommonCrawl, this means that we can now have far better precision and recall in selecting only MSOffice docs for our regression tests!!! -Original Message- From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] Sent: Tuesday, July 4, 2017 6:18 AM T

RE: [RESULT][VOTE] Apache POI 3.17-beta1 release (RC1)

2017-06-28 Thread Allison, Timothy B.
Thank you, Andi, for running the release! On 6/28/17 4:39 AM, Javen O'Neal wrote: > Thanks, Andi. > > Looks like Maven Central has the artifacts now. Other mirrors may not > be up to date, though. > http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.poi%22 > > On Jun 27, 2017 16:31, "An

RE: Moving to Git

2017-06-28 Thread Allison, Timothy B.
Y, we did it on Tika...in fact, we also jumped to github directly. It was great. Not sure it dramatically increased contributions, but it does feel modern... -Original Message- From: Greg Woolsey [mailto:greg.wool...@gmail.com] Sent: Wednesday, June 28, 2017 9:37 AM To: POI Developers

RE: [VOTE] Apache POI 3.17-beta1 release (RC1)

2017-06-27 Thread Allison, Timothy B.
+1 Checksums and sig are good. Built/tested on Windows. Thank you, Andi! -Original Message- From: Dominik Stadler [mailto:dominik.stad...@gmx.at] Sent: Monday, June 26, 2017 9:11 AM To: POI Developers List Subject: Re: [VOTE] Apache POI 3.17-beta1 release (RC1) Hi, I compared the

RE: Is it time for POI 3.17-beta1?

2017-06-23 Thread Allison, Timothy B.
Thank you, Dominik! Your reports are so much more easily navigable than mine... I'll take a look at this one next week. This is not a blocker. Caused by: java.lang.ArrayIndexOutOfBoundsException: * at o.a.p.util.LittleEndianCP950Reader.read(LittleEndianCP950Reader.java:77) at o

RE: Is it time for POI 3.17-beta1?

2017-06-23 Thread Allison, Timothy B.
My run just finished as well. http://162.242.228.174/reports/reports_poi-3.17-beta1.zip +1 to roll I get only one new exception (below) in an xlsx file (there are 7 new zlib/gzip in embedded files) and roughly 200 fixed exceptions (mostly wmf) (see: exceptions/new_exceptions_in_B_by_mime.xlsx,

RE: Is it time for POI 3.17-beta1?

2017-06-20 Thread Allison, Timothy B.
me, I will try to kick off a test-run later today, typically runs for > 24h... Dominik. On Mon, Jun 19, 2017 at 7:20 PM, Javen O'Neal wrote: > +1 from me. > > On Jun 19, 2017 04:26, "Allison, Timothy B." wrote: > > > +1 > > > > I'd like to ge

RE: Developing POI with an IDE

2017-06-20 Thread Allison, Timothy B.
I'm on Intellij, and y, first time set up with POI is a pain. What's the best way to share my setup? Lucene-solr has an ant-idea task that copies all .iml files and all files under .idea from a "dev-tools" folder into their proper place.

RE: Is it time for POI 3.17-beta1?

2017-06-19 Thread Allison, Timothy B.
+1 I'd like to get in some small modifications I've been meaning to work on. I'll have time today. -Original Message- From: Andreas Beeker [mailto:kiwiwi...@apache.org] Sent: Monday, June 19, 2017 4:09 AM To: POI Developers List Subject: Is it time for POI 3.17-beta1? Hi *, there ar

RE: POI @ ApacheCon

2017-05-15 Thread Allison, Timothy B.
I'll arrive tonight. See you there! -Original Message- From: David North [mailto:dno...@apache.org] Sent: Monday, May 15, 2017 10:20 AM To: POI Developers List Subject: POI @ ApacheCon Who else is in Miami for ApacheCon? Do we have critical mass for an in-person chat over breakfast/be

RE: xls record length exception

2017-04-27 Thread Allison, Timothy B.
n Is bug 61049 related? On Apr 27, 2017 5:16 AM, "Allison, Timothy B." wrote: > Thank you, Javen. > > As happens too often, I had senders-regret on this email. I found a > triggering file in our regression corpus and opened 61045. > > No multibytes

RE: xls record length exception

2017-04-27 Thread Allison, Timothy B.
ecord contents are sensitive, you could redact all single-byte codepoints with 0x41 ("A"). Of course at that point you've probably found the problem... On Apr 26, 2017 11:45, "Allison, Timothy B." wrote: All, I can't share the file, but... (sorry, it hurts me too). F

xls record length exception

2017-04-26 Thread Allison, Timothy B.
All, I can't share the file, but... (sorry, it hurts me too). File opens without problem in Excel. If anyone has any recommendations, I'd appreciate it. Caused by: org.apache.poi.hssf.record.RecordFormatException: Expected to find a ContinueRecord in order to read remaining 1 of 51 chars

FW: Tweet by Decalage on Twitter

2017-04-21 Thread Allison, Timothy B.
W00t! Congratulations, Dominik! [https://pbs.twimg.com/profile_images/2828800838/bda75c2c7281c026a24def1348b6c022_normal.png] Decalage (@decalage2) 4/19/17, 5:51 PM Tip

RE: [VOTE] Apache POI 3.16-final release (RC1)

2017-04-12 Thread Allison, Timothy B.
+1 Builds work on Windows (.zip) and Linux (.tar.gz) Cheers, Tim -Original Message- From: Andreas Beeker [mailto:kiwiwi...@apache.org] Sent: Tuesday, April 11, 2017 7:55 PM To: POI Developers List Subject: [VOTE] Apache POI 3.16-final release (RC1) Hi, I've prepared artif

RE: [VOTE] Apache POI 3.16-final release (RC1)

2017-04-12 Thread Allison, Timothy B.
Hmef/quick-contents/quick.html Has \r\n in rc1, but it has \n in trunk. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, April 12, 2017 11:58 AM To: POI Developers List Subject: RE: [VOTE] Apache POI 3.16-final release (RC1) I'm getting

RE: [VOTE] Apache POI 3.16-final release (RC1)

2017-04-12 Thread Allison, Timothy B.
I'm getting this on Windows w Java 8. Will look into it. Testcase: testAttachmentContents took 0.013 sec FAILED expected:<445> but was:<428> junit.framework.AssertionFailedError: expected:<445> but was:<428> at org.apache.poi.hmef.HMEFTest.assertContents(HMEFTest.java:42)

RE: POI 3.16 Final?

2017-04-11 Thread Allison, Timothy B.
HPSF in that doc. So there's clearly still room to figure out how to map the correct encoding to a given section. But I think we're good for now. Thank you, again, Andi! r1791002 -Original Message----- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Tuesday, April

Re: POI 3.16 Final?

2017-04-11 Thread Allison, Timothy B.
Got it. Will take a look shortly. From: Andreas Beeker Sent: Tuesday, April 11, 2017 3:12:47 AM To: POI Developers List Subject: RE: POI 3.16 Final? > For clarification, when you say property sets...do you mean the Document > Properties from the tablestream? No

RE: POI 3.16 Final?

2017-04-10 Thread Allison, Timothy B.
know if you > need more samples for certain failures, I can fairly easily pull them > from the database if needed. > > Dominik. > > On Mon, Apr 10, 2017 at 4:07 PM, Allison, Timothy B. > > wrote: > >> Thank you, Dominik! >> >> Is there any way I can get the

RE: POI 3.16 Final?

2017-04-10 Thread Allison, Timothy B.
s to one sample file for > each failure in the rightmost column in the table, let me know if you > need more samples for certain failures, I can fairly easily pull them > from the database if needed. > > Dominik. > > On Mon, Apr 10, 2017 at 4:07 PM, Allison, Timothy B. >

RE: bean-free ooxml streaming readers?

2017-04-10 Thread Allison, Timothy B.
>Since it would be read-only, would it just be another option, instead of a >full replacement? Y, think of it like XSSF's eventusermodel. We define an interface for what a user will have to react to, like XSSFSheetXMLHandler's SheetContentsHandler, and we take care of the rest. You can see

RE: POI 3.16 Final?

2017-04-10 Thread Allison, Timothy B.
he.org/ > > bugzilla/show_bug.cgi?id=59268 some time ago, but unfortunately it > > was very unwieldly, i.e. for some reason it was hard to even > > identify the exact version used for previous releases or build the > > latest version cleanly. > > > > Domi

RE: Build failed in Jenkins: POI-DSL-OpenJDK #113

2017-04-04 Thread Allison, Timothy B.
I cleaned up the Big5/CP950Reader, and I'm now getting clean builds with 1.6, 1.7 and 1.8 locally, and I turned back on the test I turned off in the last commit. The 1.6 build on Jenkins is looking promising. 5th time is the charm? Sorry about that. Onward... Cheers, Tim

RE: Build failed in Jenkins: POI-DSL-OpenJDK #113

2017-04-04 Thread Allison, Timothy B.
he maximum value of 7300 junit.framework.AssertionFailedError: Date column width: 7476 is greater than t$ at org.apache.poi.POITestCase.assertBetween(POITestCase.java:248) at org.apache.poi.ss.usermodel.BaseTestSheet.autoSizeDate(BaseTestSheet$ -Original Message----- From: Allison,

RE: Build failed in Jenkins: POI-DSL-OpenJDK #113

2017-04-04 Thread Allison, Timothy B.
All, Sorry about this. I was able to reproduce this failure with Java 7, but not with Java 8. I submitted a fix in r1790130. -Original Message- From: Apache Jenkins Server [mailto:jenk...@builds.apache.org] Sent: Tuesday, April 4, 2017 8:46 AM To: dev@poi.apache.org Subject: Build faile

bean-free ooxml streaming readers?

2017-04-04 Thread Allison, Timothy B.
to even identify the > exact version used for previous releases or build the latest version > cleanly. > > Dominik. > > On Mon, Apr 3, 2017 at 3:14 PM, Allison, Timothy B. > > wrote: > > > Is there anything we can do about ThreadLocal leaks in POI bug

RE: POI 3.16 Final?

2017-04-03 Thread Allison, Timothy B.
Is there anything we can do about ThreadLocal leaks in POI bug 55149/XMLBEANS-502/TIKA-1784? -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, April 3, 2017 9:06 AM To: POI Developers List Subject: RE: POI 3.16 Final? +1 for next couple days-ish

RE: POI 3.16 Final?

2017-04-03 Thread Allison, Timothy B.
+1 for next couple days-ish. I'd like to finish or abandon hope on 50955. I'll be working on that this morning. -Original Message- From: Andreas Beeker [mailto:kiwiwi...@apache.org] Sent: Saturday, April 1, 2017 6:46 PM To: POI Developers List Subject: POI 3.16 Final? Hi, how about

RE: Build failed in Jenkins: POI-DSL-1.6 #212

2017-03-16 Thread Allison, Timothy B.
So that I not repeat this or similar: ant clean test test-integration Anything else? Fix coming shortly... -Original Message- From: Timothy Allison [mailto:tball...@yahoo.com.INVALID] Sent: Thursday, March 16, 2017 4:36 PM To: dev@poi.apache.org Subject: Re: Build failed in Jenkins: PO

FW: ApacheCon: Tomorrow's Software, Today. Schedule announced!

2017-03-09 Thread Allison, Timothy B.
Nick Burch, David North and Bob Paulin, congratulations! My proposal is in the "backup queue", but I look forward to catching up with you all. If anyone wants to chat tika-eval or related, let know. Cheers, Tim -Original Message- From: Rich Bowen [mailto:rbo...@apache.org] S

RE: xlsb Streaming Reader?

2017-03-06 Thread Allison, Timothy B.
on how much you want time you have to contribute in the name of file format support. On Mar 6, 2017 6:00 AM, "Allison, Timothy B." wrote: > All, > I'm considering starting work on a streaming reader for xlsb. Has > anyone else worked on this? Would this con

xlsb Streaming Reader?

2017-03-06 Thread Allison, Timothy B.
All, I'm considering starting work on a streaming reader for xlsb. Has anyone else worked on this? Would this conflict with anyone's plans? Best, Tim [1] https://issues.apache.org/jira/browse/TIKA-1195 --

tika-eval

2017-02-17 Thread Allison, Timothy B.
All, I finally got around to adding tika-eval[1] to Apache Tika. If you have any interest in comparing the output of different tools/versions/parameters on text extraction, give it a try. You don't need to use Tika or format the output in a specific format; plain UTF-8 text will work. Ti

RE: [VOTE] Apache POI 3.16 beta 2 release (RC1)

2017-02-03 Thread Allison, Timothy B.
ge so > that this works as before again. > > Dominik. > > On Wed, Feb 1, 2017 at 10:24 PM, Allison, Timothy B. > > wrote: > > > Thank you, Dominik! > > > > Looks like there is one other new one -- very rare -- that i

RE: [VOTE] Apache POI 3.16 beta 2 release (RC1)

2017-02-01 Thread Allison, Timothy B.
Thank you, Dominik! Looks like there is one other new one -- very rare -- that isn't caused by the embedded extractor or the WMF component. java.lang.NegativeArraySizeException at o.a.p.ddf.EscherComplexProperty.resizeComplexData(EscherComplexProperty.java:102) at o.a.p.ddf.Esc

RE: [VOTE] Apache POI 3.16 beta 2 release (RC1)

2017-02-01 Thread Allison, Timothy B.
> On Jan 31, 2017 17:49, "Allison, Timothy B." wrote: > > Argh...sorry, no time this go around... > > -Original Message- > From: Javen O'Neal [mailto:one...@apache.org] > Sent: Tuesday, January 31, 2017 8:19 PM > To: POI Developers List > Subject: Re: [

RE: [VOTE] Apache POI 3.16 beta 2 release (RC1)

2017-01-31 Thread Allison, Timothy B.
Argh...sorry, no time this go around... -Original Message- From: Javen O'Neal [mailto:one...@apache.org] Sent: Tuesday, January 31, 2017 8:19 PM To: POI Developers List Subject: Re: [VOTE] Apache POI 3.16 beta 2 release (RC1) Should I wait for the results of common crawl or other regres

RE: EMF/WMF files from our regression corpus

2017-01-20 Thread Allison, Timothy B.
common objects/constants ... although I had a quick view > over the EMF code, I haven't read the spec to say anything about that. > As the WMF/EMF implementations look similar, I would simply progress > to a more complete parser and eventually generalize the code. > > Than

RE: EMF/WMF files from our regression corpus

2017-01-19 Thread Allison, Timothy B.
That's what I thought, but I figured sharing might be helpful. Andi, If you have any time to review my initial commit for the wmf parser, I'd appreciate it. The good parts I lifted from your emf code...the other parts...well...sorry. :) There may be some areas where the emf parser could lev

EMF/WMF files from our regression corpus

2017-01-18 Thread Allison, Timothy B.
All, I recently extracted ~9000 emf and ~37000 wmf files from our regression corpus: http://162.242.228.174/embedded_files/xmfs.tar.bz2 (301MB). Enjoy! Quite helpful already in working with EMF parser. Cheers, Tim

RE: Using Apache Commons IO

2017-01-17 Thread Allison, Timothy B.
Y, I agree with Nick. I'm slightly inclined to not using Commons IO to avoid potential conflicts, but I defer to the more active devs :). We can't do the equivalent of a maven-shade-plugin in Ant, can we? Looks like maybe in gradle...but... -Original Message- From: Nick Burch [mailto

RE: [Bug 60519] Extractor for *SSF embeddings

2017-01-05 Thread Allison, Timothy B.
Thank you Andi and Javen! Javen, I respect your point about "not limited to MSOffice documents". My selfish/Tika-ish goal in processing them, frankly, is only to extract embedded documents and their metadata. Andi's patch demonstrated the need to handle the "feature" distinction btwn how M

RE: [Bug 60519] Extractor for *SSF embeddings

2017-01-04 Thread Allison, Timothy B.
Andi, I like what you've done with the patch for this issue. All, Is it worthwhile adding a rudimentary EMF parser to POI? It might help us explore what other "full docs" are stuffed inside EMF like the PDFs that you found. I hacked out a version for Tika (locally), but I think this woul

RE: got docx?

2016-12-12 Thread Allison, Timothy B.
-----Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, December 12, 2016 9:58 AM To: POI Developers List Cc: d...@tika.apache.org Subject: RE: got docx? To close the loop and share my gratitude publicly... Thank you, Dominik, for transferring 41k, 5GB of do

RE: got docx?

2016-12-12 Thread Allison, Timothy B.
To close the loop and share my gratitude publicly... Thank you, Dominik, for transferring 41k, 5GB of docx/dotx to our regression corpus! I’ve already found a number of “areas for improvement” in Tika's experimental docx SAX parser, and a few areas for improvement in POI's XWPFDocument/DOM par

RE: got docx?

2016-12-06 Thread Allison, Timothy B.
urely share them via the VM, transferring them from the local hard disk will take a bit as I don't have an fast/unlimited line at home, but I should be able to put them onto a directory on the VM over the next few days. Dominik. On Tue, Dec 6, 2016 at 2:51 AM, Allison, Timothy B. wrote:

got docx?

2016-12-05 Thread Allison, Timothy B.
Dominik, Any chance you'd be willing to share you're docx/docm? I'm working on an alternate SAX parser, and the more docs for testing, the better. Cheers, Tim - To unsubscribe, e-mail: dev-unsub

RE: [Bug 60329] Avoid NPE when styleid is null

2016-12-01 Thread Allison, Timothy B.
Thank you, Mark! -Original Message- From: bugzi...@apache.org [mailto:bugzi...@apache.org] Sent: Wednesday, November 30, 2016 9:23 PM To: dev@poi.apache.org Subject: [Bug 60329] Avoid NPE when styleid is null https://bz.apache.org/bugzilla/show_bug.cgi?id=60329 --- Comment #10 from Mark

FW: ApacheCon Miami is coming in May.

2016-11-30 Thread Allison, Timothy B.
> ApacheCon and Apache Big Data will be held at the Intercontinental in Miami, > Florida, May 16-18, 2017 I plan to attend. Who's in? Any interest in collaborating on a talk or submitting your own? Cheers, Tim -Original Message- From: Rich Bowen [mailto:rbo...@apache.org]

RE: 2006 ML format?

2016-11-23 Thread Allison, Timothy B.
o opinion on the read only parser. -Original Message----- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, November 23, 2016 2:38 PM To: POI Developers List Subject: RE: 2006 ML format? All, I went it alone for the 2006ml format on Tika, see details [1]. If you have any fe

RE: 2006 ML format?

2016-11-23 Thread Allison, Timothy B.
e for the regular docx? Cheers, Tim [1] https://issues.apache.org/jira/browse/TIKA-2179?focusedCommentId=15691150&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15691150 -Original Message----- From: Allison, Timothy B. [mailto:talli...@mitre.o

RE: 2006 ML format?

2016-11-21 Thread Allison, Timothy B.
Subject: Re: 2006 ML format? Wow, this is nothing like what I thought it would be. I discovered that you can write a document in this format by selecting save as xml document. On Fri, Nov 18, 2016 at 7:03 AM, Allison, Timothy B. wrote: > Thank you, Javen. I worry that I'll be adding duct ta

RE: 2006 ML format?

2016-11-18 Thread Allison, Timothy B.
, "Allison, Timothy B." wrote: All, On TIKA-2179 [1], Sean Story submitted a document that appears to be a 2006 ML format .xml file. It appears to inline the components of a regular docx into a single xml file, no zip. Is it worth the effort to build a read-only subclass of OPCPackage

2006 ML format?

2016-11-17 Thread Allison, Timothy B.
All, On TIKA-2179 [1], Sean Story submitted a document that appears to be a 2006 ML format .xml file. It appears to inline the components of a regular docx into a single xml file, no zip. Is it worth the effort to build a read-only subclass of OPCPackage (say, InlinePackage) that would paral

RE: [VOTE] Apache POI 3.16-beta1 release (RC1)

2016-11-16 Thread Allison, Timothy B.
+1 Apologies for delay. Finished running comparisons against ~800k files. Quick look interpretation: more attachments, fewer exceptions (esp in visio), no new exceptions, better content, esp in macros. http://162.242.228.174/reports/reports_3_16-rc1.zip -Original Message- From: Greg W

RE: POI 3.16 beta 1 soon?

2016-11-10 Thread Allison, Timothy B.
> Tim, any urgent changes needed for Tika? My apologies to Andi; I haven't gotten around to testing his patch on 60345. I trust Yegor's +1 on that. It would be great if we could get that one in. I had hoped to rough out XWPFGlossaryDocument, but that isn't going to happen any time soon. Than

RE: zip exceptions in objects embedded in HSLF

2016-11-07 Thread Allison, Timothy B.
ICT files. I'm not sure how much is the effort to fix it, but for now can you swallow all errors for the "image/pict" content type? It is a known troublemaker and the best you can do for now is to catch all its exceptions. Yegor On Fri, Nov 4, 2016 at 9:25 PM, Allison, Timot

RE: zip exceptions in objects embedded in HSLF

2016-11-04 Thread Allison, Timothy B.
And for a larger collection of zip exceptions in embedded HSLF, see TIKA-2164. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Friday, November 4, 2016 11:49 AM To: POI Users List Subject: zip exceptions in objects embedded in HSLF POI Colleagues, On

RE: [Bug 60329] New: Avoid NPE when styleid is null

2016-11-02 Thread Allison, Timothy B.
Sorry. Should have moved this back to the ticket... -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, November 2, 2016 7:06 AM To: POI Developers List Subject: RE: [Bug 60329] New: Avoid NPE when styleid is null Sounds good. Do we want to auto

RE: [Bug 60329] New: Avoid NPE when styleid is null

2016-11-02 Thread Allison, Timothy B.
Sounds good. Do we want to auto-generate a styleid and risk collision with an actual styleid (chances would be very, very low) or handle the null? Thank you. -Original Message- From: Mark Murphy [mailto:jmarkmur...@gmail.com] Sent: Tuesday, November 1, 2016 5:25 PM To: POI Developers L

RE: Build failed in Jenkins: POI #1586

2016-10-19 Thread Allison, Timothy B.
Y, sorry... Mea culpa. Thank you for fixing it! -Original Message- From: Javen O'Neal [mailto:one...@apache.org] Sent: Tuesday, October 18, 2016 3:06 PM To: POI Developers List Subject: Re: Build failed in Jenkins: POI #1586 Use Charset.forName("UTF-8") for Java 6 compatibility. https

govdocs1 docs in our test suite

2016-10-18 Thread Allison, Timothy B.
All, In my VBAMacroReader blitz, I added two docs -- one unmodified and one modified -- that derive from govdocs1: Bug60273 and Bug59830. Let me know if I should remove them. Best, Tim

Apache Tika's public regression corpus

2016-10-05 Thread Allison, Timothy B.
All, I recently blogged about some of the work we're doing with a large scale regression corpus to make Tika, POI and PDFBox more robust and to identify regressions before release. If you'd like to chip in with recommendations, requests or Hadoop/Spark clusters (why not shoot for the stars), p

RE: [VOTE] Apache POI 3.15 RC3

2016-09-19 Thread Allison, Timothy B.
+1 Built on Windows and integrated successfully with Tika and Tika's unit tests. I'm relying on my earlier run of RC2 on the full regression corpus as evidence that we're in good shape. Thank you, David! Cheers, Tim -Original Message- From: David North [mailto:dno...@apa

FW: [jira] [Commented] (TIKA-2058) Memory Leak in Tika version 1.13 when parsing millions of files

2016-09-16 Thread Allison, Timothy B.
W00t! Thank you, Luis! -Original Message- From: Tim Barrett (JIRA) [mailto:j...@apache.org] Sent: Friday, September 16, 2016 9:11 AM To: talli...@apache.org Subject: [jira] [Commented] (TIKA-2058) Memory Leak in Tika version 1.13 when parsing millions of files [ https://issues.ap

RE: [VOTE] Apache POI 3.15 (RC2)

2016-09-15 Thread Allison, Timothy B.
With the exception of the ppt master slide text, the content looks consistent across the POI file types. Found two bugs in my eval code, but no regressions that would hold up the release :)

RE: [VOTE] Apache POI 3.15 (RC2)

2016-09-15 Thread Allison, Timothy B.
e special handling during close() anyway. I have some minor modifications/simplifications around these statements that I will apply post-release Dominik. On Thu, Sep 15, 2016 at 2:01 PM, Allison, Timothy B. wrote: > >* I'll take a look at the patch on TIKA-2058, if it's low-ris

RE: [VOTE] Apache POI 3.15 (RC2)

2016-09-15 Thread Allison, Timothy B.
I'm finally taking a look at the new exceptions from the reports. Most of the new exceptions seem to be on attachments within ppt. We recently changed how we handle embedded ole-wrapped attachments in ppt, and we're discovering new embedded file types. [1] Next...to look at the content report.

RE: [VOTE] Apache POI 3.15 (RC2)

2016-09-15 Thread Allison, Timothy B.
>* I'll take a look at the patch on TIKA-2058, if it's low-risk it can go in I committed Luis Filipe Nassif's patch last night (BUG 60140). Please do take a look to make sure the change doesn't cause any unforeseen problems. >> * I could do with input from those who use HSLF about whether to hol

RE: [VOTE] Apache POI 3.15-beta3

2016-09-14 Thread Allison, Timothy B.
List Subject: Re: [VOTE] Apache POI 3.15-beta3 Bug 60003 is still open and is a regression if POI should be extracting Prague from the test slideshow. https://bz.apache.org/bugzilla/show_bug.cgi?id=60003 On Fri, Sep 9, 2016 at 11:44 AM, Allison, Timothy B. wrote: > Thank you, Dominik, for

RE: [VOTE] Apache POI 3.15 (RC2)

2016-09-14 Thread Allison, Timothy B.
Regression results are here. I haven't had a chance to look. This compares Tika's trunk with poi 3.15-rc1 (? I think?) against 3.15-beta1 in Tika 1.13. Some differences might be changes at the Tika level. I ran this against the full corpus so there are file formats we don't care about. http

RE: [VOTE] Apache POI 3.15 (RC2)

2016-09-14 Thread Allison, Timothy B.
All, On TIKA-2058 [1], Luis Filipe Nassif attached a patch for POI that _may_ solve a memory leak. We haven't had a chance to test that it solves the problem. The patch looks reasonable to me (it is very short), but I don't know enough about FileBackedDataSource to apply it responsibly. If

potential memory issue in FileBackedDataSource (TIKA-2058)

2016-09-12 Thread Allison, Timothy B.
All, On TIKA-2058, Tim Barrett reported some OOM problems and posted an hprof of the issue. Luis Filipe Nassif analyzed the hprof and identified POI's FileBackedDataSource as a potential source of the problem. We haven't yet determined if this is a single-triggering file type of OOM or a st

RE: [VOTE] Apache POI 3.15 (RC2)

2016-09-12 Thread Allison, Timothy B.
Kicked off regression tests. Should have results by tomorrow. -Original Message- From: David North [mailto:dno...@apache.org] Sent: Sunday, September 11, 2016 3:47 PM To: POI Developers List Subject: [VOTE] Apache POI 3.15 (RC2) Hi everyone, My apologies for going AWOL in the middle o

RE: [VOTE] Apache POI 3.15-beta3

2016-09-09 Thread Allison, Timothy B.
it can > > be included as part of POI's unit test suite? > > > > On Fri, Aug 12, 2016 at 6:52 PM, Javen O'Neal > wrote: > >> On Aug 12, 2016 11:39, "Allison, Timothy B." > wrote: > >>>...the two potential content regressions may be caused

RE: [VOTE] Apache POI 3.15 (RC1)

2016-08-29 Thread Allison, Timothy B.
I won't have time to make any progress (or even to respond to you...sorry!) on 60044 until the end of the week... :( -Original Message- From: Javen O'Neal [mailto:one...@apache.org] Sent: Sunday, August 28, 2016 12:32 PM To: POI Developers List Subject: Re: [VOTE] Apache POI 3.15 (RC1)

  1   2   >