RE: Test document Tika-792

2017-12-05 Thread Allison, Timothy B.
w HWPF works with deleted text. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, December 04, 2017 10:43 AM To: POI Developers List <dev@poi.apache.org> Subject: RE: Test document Tika-792 I'd prefer to avoid ThreadLocal if possible. Could we

RE: Test document Tika-792

2017-12-04 Thread Allison, Timothy B.
I'd prefer to avoid ThreadLocal if possible. Could we add an enum for type of run? Perhaps use p. 4139-4140 of Ecma ooxml part 1 as the types available? In Tika, we process at the run level; we do not use paragraph's getText(), so I don't think we have any input on deleted text in Paragraph's

RE: classloading xsbs for pptx

2017-11-29 Thread Allison, Timothy B.
You're right. Thank you, Yegor. I swapped out the 3.17-beta1 jars in Solr for the 3.17 jars, and I'm not getting that exception any more. Onward! Cheers, Tim -Original Message- From: Yegor Kozlov [mailto:yegor.koz...@dinom.ru] Sent: Wednesday, November 29, 2017 5:50

RE: classloading xsbs for pptx

2017-11-28 Thread Allison, Timothy B.
il of POI that we should avoid leaking to outside libraries--even Tika. On Nov 28, 2017 09:03, "Allison, Timothy B." <talli...@mitre.org> wrote: All, We have a report that Tika's integration with Solr is now failing proper classloading on a pptx with a CTTable that can't be load

classloading xsbs for pptx

2017-11-28 Thread Allison, Timothy B.
All, We have a report that Tika's integration with Solr is now failing proper classloading on a pptx with a CTTable that can't be loaded [1]. The error message suggests doing something like this: POIXMLTypeLoader.setClassLoader(CTTable.class.getClassLoader()). Is this the right fix? Should

Running tika-eval on the Rackspace vm

2017-10-23 Thread Allison, Timothy B.
All, If anyone would like to join the fun in running tika-eval on the Rackspace vm, I posted this: https://wiki.apache.org/tika/TikaEvalOnVM . You’ll need access to the vm, of course, but I’m happy to grant that to anyone who wants to chip in and help with regression tests. There are some

3.17.1?

2017-09-26 Thread Allison, Timothy B.
--- Comment #1 from Javen O'Neal --- Sounds like a 3.17.1 might be in order. +1 What do you all think? -Original Message- From: bugzi...@apache.org [mailto:bugzi...@apache.org] Sent: Tuesday, September 26, 2017 5:23 AM To: dev@poi.apache.org Subject: [Bug 61564]

r1808930 forbidden-api checks and imports

2017-09-21 Thread Allison, Timothy B.
Dominik, Thank you for fixing my new PrintStreams so that the forbidden-api checks would pass...head in hands. I noticed that imports were shortened to wildcards. Should I flip back to listing all? Thank you, again! Best, Tim -import java.io.ByteArrayInputStream;

RE: Apache POI 4.0/Java 8 - new packages?

2017-09-15 Thread Allison, Timothy B.
ckages. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Friday, September 15, 2017 7:07 AM To: POI Developers List <dev@poi.apache.org> Subject: RE: Apache POI 4.0/Java 8 Thank you, Dominik!!! So, speaking of 4.0...should we move to semantic versi

RE: Apache POI 4.0/Java 8

2017-09-15 Thread Allison, Timothy B.
Thank you, Dominik!!! So, speaking of 4.0...should we move to semantic versioning: 4.0.0? -Original Message- From: Dominik Stadler [mailto:dominik.stad...@gmx.at] Sent: Thursday, September 14, 2017 1:39 PM To: POI Developers List Subject: Apache POI 4.0/Java 8 Hi,

RE: [VOTE] Apache POI 3.17 release (RC3)

2017-09-11 Thread Allison, Timothy B.
+1 builds on Windows and works in Tika's tests -Original Message- From: Greg Woolsey [mailto:greg.wool...@gmail.com] Sent: Saturday, September 9, 2017 3:34 PM To: POI Developers List Subject: Re: [VOTE] Apache POI 3.17 release (RC3) +1 works-for-me On Sat, Sep 9,

SAX v DOM parser for docx

2017-08-31 Thread Allison, Timothy B.
I finally got around to comparing the experimental SAX parser over on Tika with POI/DOM-based parser for docx on the 170k docx files we have. http://162.242.228.174/reports/dom_vs_sax_docx.tar.gz Fewer exceptions...more content. Both are only slight, but overall, this looks promising.

RE: [VOTE] Apache POI 3.17 release (RC2)

2017-08-31 Thread Allison, Timothy B.
Via not in... -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, August 31, 2017 4:39 PM To: POI Developers List <dev@poi.apache.org> Subject: RE: [VOTE] Apache POI 3.17 release (RC2) My fault...fixed in 61475. All good now: http://162.242.2

RE: [VOTE] Apache POI 3.17 release (RC2)

2017-08-31 Thread Allison, Timothy B.
he POI 3.17 release (RC2) Wouldn't shock me to find out it is at the XML level - Word saving the same text in two different ways under the same parent would be completely within my jaded expectations. On Thu, Aug 31, 2017 at 11:37 AM Allison, Timothy B. <talli...@mitre.org> wrote: > I ran

RE: [VOTE] Apache POI 3.17 release (RC2)

2017-08-31 Thread Allison, Timothy B.
level or the Tika level. Reports are here: http://162.242.228.174/reports/poi-3.17-rc2-docx.tar.gz -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, August 30, 2017 8:05 PM To: POI Developers List <dev@poi.apache.org> Subject: RE: [VOTE]

RE: [VOTE] Apache POI 3.17 release (RC2)

2017-08-30 Thread Allison, Timothy B.
I’ll run regression tests at least against our .docx tonight to make sure I didn’t wreck anything with 61470.

RE: POI 4.0 and Java 8

2017-08-28 Thread Allison, Timothy B.
Thank you, David! Anyone with contacts at/works for Alfresco? Other stakeholders we should ping? From: David Pilato [mailto:da...@pilato.fr] Sent: Monday, August 28, 2017 10:27 AM To: d...@tika.apache.org; Allison, Timothy B. <talli...@mitre.org> Cc: POI Developers List <dev@poi.a

RE: POI 4.0 and Java 8

2017-08-28 Thread Allison, Timothy B.
+1 from me. David, any problems with ES if Tika migrates to jdk8? -Original Message- From: Konstantin Gribov [mailto:gros...@gmail.com] Sent: Wednesday, August 23, 2017 1:11 PM To: d...@tika.apache.org Cc: POI Developers List Subject: Re: POI 4.0 and Java 8 Hi,

RE: Build failed in Jenkins: POI-DSL-Maven #271

2017-07-14 Thread Allison, Timothy B.
Fellow devs, The build started failing before my commits today. I have no doubt that my commits could cause the build to fail :|. My local build worked just fine...not sure what's going on. Best, Tim -Original Message- From: Apache Jenkins Server

RE: AddImageBench and org.openjdk.jmh...

2017-07-14 Thread Allison, Timothy B.
roject/module settings. Dominik On Jul 14, 2017 14:01, "Allison, Timothy B." <talli...@mitre.org> wrote: I'm able to build poi on the commandline, but I'm not able to run unit tests in ooxml in Intellij because of AddImageBench's use of openjdk stuff. Is there a way to work a

AddImageBench and org.openjdk.jmh...

2017-07-14 Thread Allison, Timothy B.
I'm able to build poi on the commandline, but I'm not able to run unit tests in ooxml in Intellij because of AddImageBench's use of openjdk stuff. Is there a way to work around this? Thank you. Best, Tim

[compress] differences in implementation of Zip ibm vs. oracle?

2017-07-10 Thread Allison, Timothy B.
Compress colleagues, Over on https://bz.apache.org/bugzilla/show_bug.cgi?id=61275, a user submitted two .xlsx files generated with Apache POI, one by IBM's jvm and one by Oracle's jvm. The file generated with Oracle's jvm opens without issue; however, MSOffice complains but can fix the file

RE: FW: Tika content detection and crawled "remote" content

2017-07-05 Thread Allison, Timothy B.
day, July 5, 2017 8:43 AM To: Allison, Timothy B. <talli...@mitre.org> Cc: dominik.stad...@gmx.at; POI Developers List (dev@poi.apache.org) <dev@poi.apache.org> Subject: Re: FW: Tika content detection and crawled "remote" content Yes, you'll get few 10,000 more (MS)

FW: Tika content detection and crawled "remote" content

2017-07-05 Thread Allison, Timothy B.
Dominik, Thanks to Sebastian and CommonCrawl, this means that we can now have far better precision and recall in selecting only MSOffice docs for our regression tests!!! -Original Message- From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] Sent: Tuesday, July 4, 2017 6:18 AM

RE: [RESULT][VOTE] Apache POI 3.17-beta1 release (RC1)

2017-06-28 Thread Allison, Timothy B.
Thank you, Andi, for running the release! On 6/28/17 4:39 AM, Javen O'Neal wrote: > Thanks, Andi. > > Looks like Maven Central has the artifacts now. Other mirrors may not > be up to date, though. > http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.poi%22 > > On Jun 27, 2017 16:31,

RE: Moving to Git

2017-06-28 Thread Allison, Timothy B.
Y, we did it on Tika...in fact, we also jumped to github directly. It was great. Not sure it dramatically increased contributions, but it does feel modern... -Original Message- From: Greg Woolsey [mailto:greg.wool...@gmail.com] Sent: Wednesday, June 28, 2017 9:37 AM To: POI

RE: [VOTE] Apache POI 3.17-beta1 release (RC1)

2017-06-27 Thread Allison, Timothy B.
+1 Checksums and sig are good. Built/tested on Windows. Thank you, Andi! -Original Message- From: Dominik Stadler [mailto:dominik.stad...@gmx.at] Sent: Monday, June 26, 2017 9:11 AM To: POI Developers List Subject: Re: [VOTE] Apache POI 3.17-beta1 release (RC1)

RE: Is it time for POI 3.17-beta1?

2017-06-23 Thread Allison, Timothy B.
Thank you, Dominik! Your reports are so much more easily navigable than mine... I'll take a look at this one next week. This is not a blocker. Caused by: java.lang.ArrayIndexOutOfBoundsException: * at o.a.p.util.LittleEndianCP950Reader.read(LittleEndianCP950Reader.java:77) at

RE: Is it time for POI 3.17-beta1?

2017-06-23 Thread Allison, Timothy B.
My run just finished as well. http://162.242.228.174/reports/reports_poi-3.17-beta1.zip +1 to roll I get only one new exception (below) in an xlsx file (there are 7 new zlib/gzip in embedded files) and roughly 200 fixed exceptions (mostly wmf) (see:

RE: Is it time for POI 3.17-beta1?

2017-06-20 Thread Allison, Timothy B.
I 3.17-beta1? Ok with me, I will try to kick off a test-run later today, typically runs for > 24h... Dominik. On Mon, Jun 19, 2017 at 7:20 PM, Javen O'Neal <one...@apache.org> wrote: > +1 from me. > > On Jun 19, 2017 04:26, "Allison, Timothy B." <talli...@mitre.

RE: Developing POI with an IDE

2017-06-20 Thread Allison, Timothy B.
I'm on Intellij, and y, first time set up with POI is a pain. What's the best way to share my setup? Lucene-solr has an ant-idea task that copies all .iml files and all files under .idea from a "dev-tools" folder into their proper place.

RE: Is it time for POI 3.17-beta1?

2017-06-19 Thread Allison, Timothy B.
+1 I'd like to get in some small modifications I've been meaning to work on. I'll have time today. -Original Message- From: Andreas Beeker [mailto:kiwiwi...@apache.org] Sent: Monday, June 19, 2017 4:09 AM To: POI Developers List Subject: Is it time for POI

RE: POI @ ApacheCon

2017-05-15 Thread Allison, Timothy B.
I'll arrive tonight. See you there! -Original Message- From: David North [mailto:dno...@apache.org] Sent: Monday, May 15, 2017 10:20 AM To: POI Developers List Subject: POI @ ApacheCon Who else is in Miami for ApacheCon? Do we have critical mass for an in-person

RE: xls record length exception

2017-04-27 Thread Allison, Timothy B.
length exception Is bug 61049 related? On Apr 27, 2017 5:16 AM, "Allison, Timothy B." <talli...@mitre.org> wrote: > Thank you, Javen. > > As happens too often, I had senders-regret on this email. I found a > triggering file in our regression corpus and opened 61045. >

RE: xls record length exception

2017-04-27 Thread Allison, Timothy B.
record? If the record contents are sensitive, you could redact all single-byte codepoints with 0x41 ("A"). Of course at that point you've probably found the problem... On Apr 26, 2017 11:45, "Allison, Timothy B." <talli...@mitre.org> wrote: All, I can't share the file,

xls record length exception

2017-04-26 Thread Allison, Timothy B.
All, I can't share the file, but... (sorry, it hurts me too). File opens without problem in Excel. If anyone has any recommendations, I'd appreciate it. Caused by: org.apache.poi.hssf.record.RecordFormatException: Expected to find a ContinueRecord in order to read remaining 1 of 51 chars

FW: Tweet by Decalage on Twitter

2017-04-21 Thread Allison, Timothy B.
W00t! Congratulations, Dominik! [https://pbs.twimg.com/profile_images/2828800838/bda75c2c7281c026a24def1348b6c022_normal.png] Decalage (@decalage2) 4/19/17, 5:51 PM Tip:

RE: [VOTE] Apache POI 3.16-final release (RC1)

2017-04-12 Thread Allison, Timothy B.
+1 Builds work on Windows (.zip) and Linux (.tar.gz) Cheers, Tim -Original Message- From: Andreas Beeker [mailto:kiwiwi...@apache.org] Sent: Tuesday, April 11, 2017 7:55 PM To: POI Developers List Subject: [VOTE] Apache POI 3.16-final release (RC1) Hi,

RE: [VOTE] Apache POI 3.16-final release (RC1)

2017-04-12 Thread Allison, Timothy B.
Hmef/quick-contents/quick.html Has \r\n in rc1, but it has \n in trunk. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, April 12, 2017 11:58 AM To: POI Developers List <dev@poi.apache.org> Subject: RE: [VOTE] Apache POI 3.16-final releas

RE: [VOTE] Apache POI 3.16-final release (RC1)

2017-04-12 Thread Allison, Timothy B.
I'm getting this on Windows w Java 8. Will look into it. Testcase: testAttachmentContents took 0.013 sec FAILED expected:<445> but was:<428> junit.framework.AssertionFailedError: expected:<445> but was:<428> at org.apache.poi.hmef.HMEFTest.assertContents(HMEFTest.java:42)

RE: POI 3.16 Final?

2017-04-11 Thread Allison, Timothy B.
-- there is no HPSF in that doc. So there's clearly still room to figure out how to map the correct encoding to a given section. But I think we're good for now. Thank you, again, Andi! r1791002 -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Tuesday, April 11, 2017 6

Re: POI 3.16 Final?

2017-04-11 Thread Allison, Timothy B.
Got it. Will take a look shortly. From: Andreas Beeker Sent: Tuesday, April 11, 2017 3:12:47 AM To: POI Developers List Subject: RE: POI 3.16 Final? > For clarification, when you say property sets...do you mean the Document > Properties

RE: POI 3.16 Final?

2017-04-10 Thread Allison, Timothy B.
now if you > need more samples for certain failures, I can fairly easily pull them > from the database if needed. > > Dominik. > > On Mon, Apr 10, 2017 at 4:07 PM, Allison, Timothy B. > <talli...@mitre.org> > wrote: > >> Thank you, Dominik! >> >>

RE: POI 3.16 Final?

2017-04-10 Thread Allison, Timothy B.
o include links to one sample file for > each failure in the rightmost column in the table, let me know if you > need more samples for certain failures, I can fairly easily pull them > from the database if needed. > > Dominik. > > On Mon, Apr 10, 2017 at 4:07 PM, Allison, Timoth

RE: bean-free ooxml streaming readers?

2017-04-10 Thread Allison, Timothy B.
>Since it would be read-only, would it just be another option, instead of a >full replacement? Y, think of it like XSSF's eventusermodel. We define an interface for what a user will have to react to, like XSSFSheetXMLHandler's SheetContentsHandler, and we take care of the rest. You can see

RE: POI 3.16 Final?

2017-04-10 Thread Allison, Timothy B.
y work on > > https://bz.apache.org/ > > bugzilla/show_bug.cgi?id=59268 some time ago, but unfortunately it > > was very unwieldly, i.e. for some reason it was hard to even > > identify the exact version used for previous releases or build the > > latest version clean

RE: Build failed in Jenkins: POI-DSL-OpenJDK #113

2017-04-04 Thread Allison, Timothy B.
I cleaned up the Big5/CP950Reader, and I'm now getting clean builds with 1.6, 1.7 and 1.8 locally, and I turned back on the test I turned off in the last commit. The 1.6 build on Jenkins is looking promising. 5th time is the charm? Sorry about that. Onward... Cheers, Tim

RE: Build failed in Jenkins: POI-DSL-OpenJDK #113

2017-04-04 Thread Allison, Timothy B.
value of 7300 junit.framework.AssertionFailedError: Date column width: 7476 is greater than t$ at org.apache.poi.POITestCase.assertBetween(POITestCase.java:248) at org.apache.poi.ss.usermodel.BaseTestSheet.autoSizeDate(BaseTestSheet$ -Original Message- From: Allison, Timothy B

RE: Build failed in Jenkins: POI-DSL-OpenJDK #113

2017-04-04 Thread Allison, Timothy B.
All, Sorry about this. I was able to reproduce this failure with Java 7, but not with Java 8. I submitted a fix in r1790130. -Original Message- From: Apache Jenkins Server [mailto:jenk...@builds.apache.org] Sent: Tuesday, April 4, 2017 8:46 AM To: dev@poi.apache.org Subject: Build

bean-free ooxml streaming readers?

2017-04-04 Thread Allison, Timothy B.
even identify the > exact version used for previous releases or build the latest version > cleanly. > > Dominik. > > On Mon, Apr 3, 2017 at 3:14 PM, Allison, Timothy B. > <talli...@mitre.org> > wrote: > > > Is there anything we can do about ThreadLocal leak

RE: POI 3.16 Final?

2017-04-03 Thread Allison, Timothy B.
Is there anything we can do about ThreadLocal leaks in POI bug 55149/XMLBEANS-502/TIKA-1784? -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, April 3, 2017 9:06 AM To: POI Developers List <dev@poi.apache.org> Subject: RE: POI 3.16 Fina

RE: POI 3.16 Final?

2017-04-03 Thread Allison, Timothy B.
+1 for next couple days-ish. I'd like to finish or abandon hope on 50955. I'll be working on that this morning. -Original Message- From: Andreas Beeker [mailto:kiwiwi...@apache.org] Sent: Saturday, April 1, 2017 6:46 PM To: POI Developers List Subject: POI 3.16

RE: Build failed in Jenkins: POI-DSL-1.6 #212

2017-03-16 Thread Allison, Timothy B.
So that I not repeat this or similar: ant clean test test-integration Anything else? Fix coming shortly... -Original Message- From: Timothy Allison [mailto:tball...@yahoo.com.INVALID] Sent: Thursday, March 16, 2017 4:36 PM To: dev@poi.apache.org Subject: Re: Build failed in Jenkins:

FW: ApacheCon: Tomorrow's Software, Today. Schedule announced!

2017-03-09 Thread Allison, Timothy B.
Nick Burch, David North and Bob Paulin, congratulations! My proposal is in the "backup queue", but I look forward to catching up with you all. If anyone wants to chat tika-eval or related, let know. Cheers, Tim -Original Message- From: Rich Bowen [mailto:rbo...@apache.org]

RE: xlsb Streaming Reader?

2017-03-06 Thread Allison, Timothy B.
ending on how much you want time you have to contribute in the name of file format support. On Mar 6, 2017 6:00 AM, "Allison, Timothy B." <talli...@mitre.org> wrote: > All, > I'm considering starting work on a streaming reader for xlsb. Has > anyone else worke

xlsb Streaming Reader?

2017-03-06 Thread Allison, Timothy B.
All, I'm considering starting work on a streaming reader for xlsb. Has anyone else worked on this? Would this conflict with anyone's plans? Best, Tim [1] https://issues.apache.org/jira/browse/TIKA-1195

tika-eval

2017-02-17 Thread Allison, Timothy B.
All, I finally got around to adding tika-eval[1] to Apache Tika. If you have any interest in comparing the output of different tools/versions/parameters on text extraction, give it a try. You don't need to use Tika or format the output in a specific format; plain UTF-8 text will work.

RE: [VOTE] Apache POI 3.16 beta 2 release (RC1)

2017-02-03 Thread Allison, Timothy B.
e beta2-release which reverts some parts of this change so > that this works as before again. > > Dominik. > > On Wed, Feb 1, 2017 at 10:24 PM, Allison, Timothy B. > <talli...@mitre.org> > wrote: > > > Thank you, Dominik! > > > > Looks like there

RE: [VOTE] Apache POI 3.16 beta 2 release (RC1)

2017-02-01 Thread Allison, Timothy B.
Thank you, Dominik! Looks like there is one other new one -- very rare -- that isn't caused by the embedded extractor or the WMF component. java.lang.NegativeArraySizeException at o.a.p.ddf.EscherComplexProperty.resizeComplexData(EscherComplexProperty.java:102) at

RE: [VOTE] Apache POI 3.16 beta 2 release (RC1)

2017-02-01 Thread Allison, Timothy B.
ll release it in a few hours. > > On Jan 31, 2017 17:49, "Allison, Timothy B." <talli...@mitre.org> wrote: > > Argh...sorry, no time this go around... > > -Original Message- > From: Javen O'Neal [mailto:one...@apache.org] > Sent: Tuesday, January 31, 2

RE: [VOTE] Apache POI 3.16 beta 2 release (RC1)

2017-01-31 Thread Allison, Timothy B.
Argh...sorry, no time this go around... -Original Message- From: Javen O'Neal [mailto:one...@apache.org] Sent: Tuesday, January 31, 2017 8:19 PM To: POI Developers List Subject: Re: [VOTE] Apache POI 3.16 beta 2 release (RC1) Should I wait for the results of common

RE: EMF/WMF files from our regression corpus

2017-01-20 Thread Allison, Timothy B.
ly generalize the code. > > Thanks for your work. > > Andi > > > On 19.01.2017 17:30, Allison, Timothy B. wrote: > > That's what I thought, but I figured sharing might be helpful. > > > > Andi, > > If you have any time to review my initial commit for the

RE: EMF/WMF files from our regression corpus

2017-01-19 Thread Allison, Timothy B.
That's what I thought, but I figured sharing might be helpful. Andi, If you have any time to review my initial commit for the wmf parser, I'd appreciate it. The good parts I lifted from your emf code...the other parts...well...sorry. :) There may be some areas where the emf parser could

EMF/WMF files from our regression corpus

2017-01-18 Thread Allison, Timothy B.
All, I recently extracted ~9000 emf and ~37000 wmf files from our regression corpus: http://162.242.228.174/embedded_files/xmfs.tar.bz2 (301MB). Enjoy! Quite helpful already in working with EMF parser. Cheers, Tim

RE: Using Apache Commons IO

2017-01-17 Thread Allison, Timothy B.
Y, I agree with Nick. I'm slightly inclined to not using Commons IO to avoid potential conflicts, but I defer to the more active devs :). We can't do the equivalent of a maven-shade-plugin in Ant, can we? Looks like maybe in gradle...but... -Original Message- From: Nick Burch

RE: [Bug 60519] Extractor for *SSF embeddings

2017-01-05 Thread Allison, Timothy B.
Thank you Andi and Javen! Javen, I respect your point about "not limited to MSOffice documents". My selfish/Tika-ish goal in processing them, frankly, is only to extract embedded documents and their metadata. Andi's patch demonstrated the need to handle the "feature" distinction btwn how

RE: [Bug 60519] Extractor for *SSF embeddings

2017-01-04 Thread Allison, Timothy B.
Andi, I like what you've done with the patch for this issue. All, Is it worthwhile adding a rudimentary EMF parser to POI? It might help us explore what other "full docs" are stuffed inside EMF like the PDFs that you found. I hacked out a version for Tika (locally), but I think this

RE: got docx?

2016-12-12 Thread Allison, Timothy B.
es. -Original Message----- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, December 12, 2016 9:58 AM To: POI Developers List <dev@poi.apache.org> Cc: d...@tika.apache.org Subject: RE: got docx? To close the loop and share my gratitude publicly... Thank you, Dominik, f

RE: got docx?

2016-12-12 Thread Allison, Timothy B.
To close the loop and share my gratitude publicly... Thank you, Dominik, for transferring 41k, 5GB of docx/dotx to our regression corpus! I’ve already found a number of “areas for improvement” in Tika's experimental docx SAX parser, and a few areas for improvement in POI's XWPFDocument/DOM

RE: got docx?

2016-12-06 Thread Allison, Timothy B.
corpus documents? I can surely share them via the VM, transferring them from the local hard disk will take a bit as I don't have an fast/unlimited line at home, but I should be able to put them onto a directory on the VM over the next few days. Dominik. On Tue, Dec 6, 2016 at 2:51 AM, Allison, Tim

got docx?

2016-12-05 Thread Allison, Timothy B.
Dominik, Any chance you'd be willing to share you're docx/docm? I'm working on an alternate SAX parser, and the more docs for testing, the better. Cheers, Tim - To unsubscribe, e-mail:

RE: [Bug 60329] Avoid NPE when styleid is null

2016-12-01 Thread Allison, Timothy B.
Thank you, Mark! -Original Message- From: bugzi...@apache.org [mailto:bugzi...@apache.org] Sent: Wednesday, November 30, 2016 9:23 PM To: dev@poi.apache.org Subject: [Bug 60329] Avoid NPE when styleid is null https://bz.apache.org/bugzilla/show_bug.cgi?id=60329 --- Comment #10 from

RE: 2006 ML format?

2016-11-23 Thread Allison, Timothy B.
e no opinion on the read only parser. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, November 23, 2016 2:38 PM To: POI Developers List <dev@poi.apache.org> Subject: RE: 2006 ML format? All, I went it alone for the 2006ml format on Ti

RE: 2006 ML format?

2016-11-23 Thread Allison, Timothy B.
for the regular docx? Cheers, Tim [1] https://issues.apache.org/jira/browse/TIKA-2179?focusedCommentId=15691150=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15691150 -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday

RE: 2006 ML format?

2016-11-21 Thread Allison, Timothy B.
oi.apache.org> Subject: Re: 2006 ML format? Wow, this is nothing like what I thought it would be. I discovered that you can write a document in this format by selecting save as xml document. On Fri, Nov 18, 2016 at 7:03 AM, Allison, Timothy B. <talli...@mitre.org> wrote: > Thank you,

RE: 2006 ML format?

2016-11-18 Thread Allison, Timothy B.
, 2016 10:55 AM, "Allison, Timothy B." <talli...@mitre.org> wrote: All, On TIKA-2179 [1], Sean Story submitted a document that appears to be a 2006 ML format .xml file. It appears to inline the components of a regular docx into a single xml file, no zip. Is it worth the effort

2006 ML format?

2016-11-17 Thread Allison, Timothy B.
All, On TIKA-2179 [1], Sean Story submitted a document that appears to be a 2006 ML format .xml file. It appears to inline the components of a regular docx into a single xml file, no zip. Is it worth the effort to build a read-only subclass of OPCPackage (say, InlinePackage) that would

RE: [VOTE] Apache POI 3.16-beta1 release (RC1)

2016-11-16 Thread Allison, Timothy B.
+1 Apologies for delay. Finished running comparisons against ~800k files. Quick look interpretation: more attachments, fewer exceptions (esp in visio), no new exceptions, better content, esp in macros. http://162.242.228.174/reports/reports_3_16-rc1.zip -Original Message- From: Greg

RE: POI 3.16 beta 1 soon?

2016-11-10 Thread Allison, Timothy B.
> Tim, any urgent changes needed for Tika? My apologies to Andi; I haven't gotten around to testing his patch on 60345. I trust Yegor's +1 on that. It would be great if we could get that one in. I had hoped to rough out XWPFGlossaryDocument, but that isn't going to happen any time soon.

RE: zip exceptions in objects embedded in HSLF

2016-11-07 Thread Allison, Timothy B.
ases when reading PICT files. I'm not sure how much is the effort to fix it, but for now can you swallow all errors for the "image/pict" content type? It is a known troublemaker and the best you can do for now is to catch all its exceptions. Yegor On Fri, Nov 4, 2016 at

RE: zip exceptions in objects embedded in HSLF

2016-11-04 Thread Allison, Timothy B.
And for a larger collection of zip exceptions in embedded HSLF, see TIKA-2164. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Friday, November 4, 2016 11:49 AM To: POI Users List <u...@poi.apache.org> Subject: zip exceptions in objects embedded i

RE: [Bug 60329] New: Avoid NPE when styleid is null

2016-11-02 Thread Allison, Timothy B.
Sorry. Should have moved this back to the ticket... -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, November 2, 2016 7:06 AM To: POI Developers List <dev@poi.apache.org> Subject: RE: [Bug 60329] New: Avoid NPE when styleid is null Sound

RE: [Bug 60329] New: Avoid NPE when styleid is null

2016-11-02 Thread Allison, Timothy B.
Sounds good. Do we want to auto-generate a styleid and risk collision with an actual styleid (chances would be very, very low) or handle the null? Thank you. -Original Message- From: Mark Murphy [mailto:jmarkmur...@gmail.com] Sent: Tuesday, November 1, 2016 5:25 PM To: POI Developers

RE: Build failed in Jenkins: POI #1586

2016-10-19 Thread Allison, Timothy B.
Y, sorry... Mea culpa. Thank you for fixing it! -Original Message- From: Javen O'Neal [mailto:one...@apache.org] Sent: Tuesday, October 18, 2016 3:06 PM To: POI Developers List Subject: Re: Build failed in Jenkins: POI #1586 Use Charset.forName("UTF-8") for Java 6

govdocs1 docs in our test suite

2016-10-18 Thread Allison, Timothy B.
All, In my VBAMacroReader blitz, I added two docs -- one unmodified and one modified -- that derive from govdocs1: Bug60273 and Bug59830. Let me know if I should remove them. Best, Tim

Apache Tika's public regression corpus

2016-10-05 Thread Allison, Timothy B.
All, I recently blogged about some of the work we're doing with a large scale regression corpus to make Tika, POI and PDFBox more robust and to identify regressions before release. If you'd like to chip in with recommendations, requests or Hadoop/Spark clusters (why not shoot for the stars),

RE: [VOTE] Apache POI 3.15 (RC2)

2016-09-15 Thread Allison, Timothy B.
With the exception of the ppt master slide text, the content looks consistent across the POI file types. Found two bugs in my eval code, but no regressions that would hold up the release :)

RE: [VOTE] Apache POI 3.15 (RC2)

2016-09-15 Thread Allison, Timothy B.
if we do not need the special handling during close() anyway. I have some minor modifications/simplifications around these statements that I will apply post-release Dominik. On Thu, Sep 15, 2016 at 2:01 PM, Allison, Timothy B. <talli...@mitre.org> wrote: > >* I'll take a look at the

RE: [VOTE] Apache POI 3.15 (RC2)

2016-09-15 Thread Allison, Timothy B.
I'm finally taking a look at the new exceptions from the reports. Most of the new exceptions seem to be on attachments within ppt. We recently changed how we handle embedded ole-wrapped attachments in ppt, and we're discovering new embedded file types. [1] Next...to look at the content

RE: [VOTE] Apache POI 3.15 (RC2)

2016-09-15 Thread Allison, Timothy B.
>* I'll take a look at the patch on TIKA-2058, if it's low-risk it can go in I committed Luis Filipe Nassif's patch last night (BUG 60140). Please do take a look to make sure the change doesn't cause any unforeseen problems. >> * I could do with input from those who use HSLF about whether to

RE: [VOTE] Apache POI 3.15-beta3

2016-09-14 Thread Allison, Timothy B.
<dev@poi.apache.org> Subject: Re: [VOTE] Apache POI 3.15-beta3 Bug 60003 is still open and is a regression if POI should be extracting Prague from the test slideshow. https://bz.apache.org/bugzilla/show_bug.cgi?id=60003 On Fri, Sep 9, 2016 at 11:44 AM, Allison, Timothy B. <talli...@mitre.o

RE: [VOTE] Apache POI 3.15 (RC2)

2016-09-14 Thread Allison, Timothy B.
Regression results are here. I haven't had a chance to look. This compares Tika's trunk with poi 3.15-rc1 (? I think?) against 3.15-beta1 in Tika 1.13. Some differences might be changes at the Tika level. I ran this against the full corpus so there are file formats we don't care about.

RE: [VOTE] Apache POI 3.15 (RC2)

2016-09-14 Thread Allison, Timothy B.
All, On TIKA-2058 [1], Luis Filipe Nassif attached a patch for POI that _may_ solve a memory leak. We haven't had a chance to test that it solves the problem. The patch looks reasonable to me (it is very short), but I don't know enough about FileBackedDataSource to apply it responsibly.

potential memory issue in FileBackedDataSource (TIKA-2058)

2016-09-12 Thread Allison, Timothy B.
All, On TIKA-2058, Tim Barrett reported some OOM problems and posted an hprof of the issue. Luis Filipe Nassif analyzed the hprof and identified POI's FileBackedDataSource as a potential source of the problem. We haven't yet determined if this is a single-triggering file type of OOM or a

RE: [VOTE] Apache POI 3.15 (RC2)

2016-09-12 Thread Allison, Timothy B.
Kicked off regression tests. Should have results by tomorrow. -Original Message- From: David North [mailto:dno...@apache.org] Sent: Sunday, September 11, 2016 3:47 PM To: POI Developers List Subject: [VOTE] Apache POI 3.15 (RC2) Hi everyone, My apologies for going

RE: [VOTE] Apache POI 3.15-beta3

2016-09-09 Thread Allison, Timothy B.
pptx file shareable and ASL 2.0 licensed so that it can > > be included as part of POI's unit test suite? > > > > On Fri, Aug 12, 2016 at 6:52 PM, Javen O'Neal <javenon...@gmail.com> > wrote: > >> On Aug 12, 2016 11:39, "Allison, Timothy B." <talli...

RE: [VOTE] Apache POI 3.15 (RC1)

2016-08-29 Thread Allison, Timothy B.
I won't have time to make any progress (or even to respond to you...sorry!) on 60044 until the end of the week... :( -Original Message- From: Javen O'Neal [mailto:one...@apache.org] Sent: Sunday, August 28, 2016 12:32 PM To: POI Developers List Subject: Re: [VOTE]

RE: [VOTE] Apache POI 3.15-beta3

2016-08-11 Thread Allison, Timothy B.
Not -1 worthy, but should we include the maven subdirectory in the src release so that people can easily run the maven-poms task? From: Dominik Stadler [mailto:dominik.stad...@gmx.at] Sent: Thursday, August 11, 2016 8:19 AM To: POI Developers List Subject: Re: [VOTE] Apache

office dissector

2016-07-01 Thread Allison, Timothy B.
All, I recently came across: https://github.com/grierforensics/officedissector . We've added their test docs (esp. those from Fraunhofer Fokus (http://www.document-interoperability.com/) to our regression corpus on Tika. Might be of interest. Cheers, Tim

RE: [VOTE] Apache POI 3.15-beta2 release (RC1)

2016-07-01 Thread Allison, Timothy B.
And, I should have added, thank you, Andi, for running the release!!! -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, June 30, 2016 10:46 AM To: POI Developers List <dev@poi.apache.org> Subject: RE: [VOTE] Apache POI 3.15-beta2 releas

  1   2   >