All,
testUrlOnly in MimeDetectionTest makes a URL.openStream() call. I have to
modify Tika's pom with my proxy info to get the test to pass in my
environment. Would the test still test Tika-327 if we modified the test to
read from a local copy of nheri's html? It looks to me like the
Doh! Please ignore last email: https://issues.apache.org/jira/browse/TIKA-1129
Would anyone mind if I recreated the structure from the offending html so that
we can return this test to test a local copy of the document?
-Original Message-
From: Allison, Timothy B.
Sent: Friday, June
I think I may be uniquely qualified to answer this from an Idiot's guide/newish
to Tika perspective. :) Apologies if I'm missing out on more obvious answers!
SVN info:
http://tika.apache.org/source-repository.html
Generally how to contribute (Lucene has a good description):
-repository.html site?
Thank you.
Best,
Tim
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, July 11, 2013 10:53 AM
To: dev@tika.apache.org
Subject: RE: MagicDetector don't work for all RFC882 message Types.
I think I may be uniquely
Wow. Thank you, all! I very much look forward to working with you in these
new roles.
Best,
Tim
From: Michael McCandless [luc...@mikemccandless.com]
Sent: Tuesday, July 30, 2013 6:29 AM
To: dev@tika.apache.org
Subject: Re: [Announce]
The MITRE Corporation
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Tuesday, July 30, 2013 8:55 AM
To: dev@tika.apache.org
Subject: RE: [Announce] Welcome Tim Allison as Tika PM member and committer
Wow. Thank you, all! I very much look forward to working
All,
I don't appear to have permissions to close out issues that I didn't open
(TIKA-1001 and TIKA-1153). Is this standard jira policy or user error? Thank
you.
Best,
Tim
-Original Message-
From: Tim Allison (JIRA) [mailto:j...@apache.org]
Sent: Wednesday,
All,
Is there an easy way to build Tika from scratch without reliance on
1.5-SNAPSHOT in the mvn repository and without building the components in the
correct order and then manually loading them into a local mvn repository?
At the main level, I've been using a simple 'mvn package'
Thank
: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, September 19, 2013 2:20 PM
To: dev@tika.apache.org
Subject: building tika from scratch without pulling 1.5-SNAPSHOT from the
repository?
All,
Is there an easy way to build Tika from scratch without reliance on
1.5-SNAPSHOT
Does the speedup only help if you are trying to parse an individual page vs the
entire document? If so, is partial parsing a use case for Tika? If this has
the same performance on the full document as the regular parser, does it have
lower memory overhead?
-Original Message-
From:
All,
How do we fix the Tika build in Jenkins?
The polling log shows an IOException when trying to install maven
(https://builds.apache.org/job/Tika-trunk/scmPollLog/). Permissions or space
issue?
The last successful tika-app-1.5 SNAPSHOT
Speaking of building...is there an easy way to build Tika locally without
reference to the repositories and without building each component one by one
(in the correct order) and then manually installing in a local repository.
Thank you!
Downloading:
Feb 2014, Allison, Timothy B. wrote:
Speaking of building...is there an easy way to build Tika locally
without reference to the repositories and without building each
component one by one (in the correct order) and then manually installing
in a local repository.
I just do:
cd root of tika
I haven't seen the problem, but that's my test. Will take a look.
-Original Message-
From: Nick Burch [mailto:n...@apache.org]
Sent: Wednesday, February 19, 2014 9:44 AM
To: dev@tika.apache.org
Subject: Failing test - PDFParserTest.testSequentialParser
I've just tried to build Tika
Y. Sorry about that. Changing to 15 now.
-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org]
Sent: Wednesday, February 19, 2014 10:10 AM
To: dev@tika.apache.org
Subject: RE: Failing test - PDFParserTest.testSequentialParser
On Wed, 19 Feb 2014, Allison, Timothy B. wrote
handle this; NonSequentialParser is currently not reading the header version)
Cheers,
Tim
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Wednesday, February 19, 2014 10:18 AM
To: dev@tika.apache.org
Subject: RE: Failing test
Failure here too. My last successful pull and build occurred Feb 19.
-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org]
Sent: Tuesday, February 25, 2014 8:14 PM
To: dev@tika.apache.org
Subject: Re: Build failure at trunk in
org.apache.tika.server.UnpackerResourceTest
On
He's alive!!!
My bet is:
TIKA-1243 - Upgrade to Commons Compress 1.7, and add a disabled unit test for
7z support. 7z support is not enabled yet, pending a commons compress fix
When I changed trunk back to 1.5, all tests pass.
Change in MD5 implementation btwn Compress 1.5 and 1.7?
Sorry, should have been clearer: changed the pom in trunk to pull Compress 1.5.
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Wednesday, February 26, 2014 11:35 AM
To: dev@tika.apache.org
Subject: RE: Build failed in Jenkins: Tika-trunk #1062
He's alive
Hong-Thai,
Thank you for running these tests. I suspect (mea culpa) that the increase
in PDF runtime exception failures was caused by PDFBOX-1803/TIKA-1233, which
was not fixed before 1.5 was cut.
I recently made major modifications to the metadata extraction components of
the PDFParser
I've been having problems with co and updating over the last few days.
-Original Message-
From: Hong-Thai Nguyen [mailto:hong-thai.ngu...@polyspot.com]
Sent: Thursday, April 03, 2014 11:37 AM
To: dev@tika.apache.org
Subject: Unable to commit SVN ?
Hi Tika men,
I have 500 error when
+1 Please, yes. Thank you!
-Original Message-
From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com]
Sent: Wednesday, May 14, 2014 11:21 AM
To: dev@tika.apache.org
Subject: [DISCUSS] Nightly Jenkins Builds for Trunk
Hi Folks,
Right now in Jenkins (builds.apache.org) we don't
Thank you, Nick. Will open trivial issue and fix.
-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org]
Sent: Tuesday, May 20, 2014 5:27 PM
To: dev@tika.apache.org
Subject: Re: Property type closed choice
On Tue, 20 May 2014, Allison, Timothy B. wrote:
When I run
Welcome, Tyler!
I found Jukka's how-to dev Tika in Eclipse very useful (don't know if you are
an Eclipser, though):
http://lucene.472066.n3.nabble.com/Newb-IDE-Maven-tp3389963p3390012.html
As with many projects, some of the most useful documentation is in the test
cases, head to the test
All,
Nick recommended I put the question to the dev list for discussion. It might
be useful to centralize our json handling of Metadata. We are now currently
using different libraries and doing different things in CLI and in tika-server.
1) Do we want to centralize json handling of
Ok, should work as of r1601444. Thank you, Nick, for working through this
issue with me.
-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org]
Sent: Monday, June 09, 2014 10:45 AM
To: dev@tika.apache.org
Subject: Re: Timezone issue with TTF parser?
On Mon, 9 Jun 2014, Ken
All,
In working on adding the stacktrace from a parse exception to the server
response, I'm trying to find the most jax-rsly elegant way of handling
exceptions. There seems to be a bit of duplicated code, some with good reason,
for exception handling. Is TikaExceptionMapper actually used
, sorry for a delay,
I can see it is expected to process a checked exception, so unless we
have one of the root resources throwing it from one of the methods then
it is not used
Thanks, Sergey
On 19/06/14 02:22, Allison, Timothy B. wrote:
All,
In working on adding the stacktrace from a parse
Thought this might be of interest.
-Original Message-
From: John Hewson [mailto:j...@jahewson.com]
Sent: Friday, June 27, 2014 2:58 AM
To: DImuthu Upeksha
Cc: d...@pdfbox.apache.org
Subject: Re: Improving OCR plugin for PDFBox
Hi Dimuthu
That's great. We should wait until closer to the
John,
My initial plan for TIKA-1302 is very similar to what Tilman outlined, and
my understanding/concerns/thoughts were very much in line with what he
articulated. The idea is that there should be a small Apache license-able gold
truth set like both projects now have for specific unit
Ditto what Nick said on internal metadata. Are you referring to external
metadata that we could get in Java 7 via BasicFileAttributes on OS's that
support those?
-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org]
Sent: Thursday, July 10, 2014 6:52 AM
To:
+1
Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
Windows 7, Java 1.7
I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs (all
formats) plus all available msoffice-x files in govdocs1, yielding 10,413 docs.
There were several improvements in text extraction
:
+1
OSX 10.9.3, Java 1.7
Tyler
On Mon, Jul 28, 2014 at 7:09 AM, Allison, Timothy B.
talli...@mitre.org
wrote:
+1
Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
Windows 7, Java 1.7
I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000
docs
(all formats
[mailto:apa...@gagravarr.org]
Sent: Thursday, July 31, 2014 3:06 PM
To: dev@tika.apache.org
Subject: RE: [VOTE] Apache Tika 1.6 release candidate #1
On Thu, 31 Jul 2014, Allison, Timothy B. wrote:
On a related note, I did some digging on the one regression I found in
the pptx
Rat checked out, successful build on linux.
+1... with one reservation
I just ran a fresh update of trunk from Tika with RC for POI 3.11 Beta 1
against a random selection of ~10k files from govdocs1, covering many formats.
There aren't many office-x files, but there are some, and I made sure
Great to hear! Maybe we just need to update something on the Tika side to
grab the cell comments:
http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
?
The token check is my primordial TIKA-1302 code,
Recursion is one that causes confusion, we've got some example programs
on
the wiki that we can include:
https://wiki.apache.org/tika/RecursiveMetadata
Ray Gauss is probably our best bet for advanced metadata stuff to send in
some examples on that!
For development on TIKA-1302, I've been using
Probably better question for the user list.
Extending a ContentHandler and using that in ContentHandlerDecorator is pretty
straightforward.
Would it be easy enough to write to file by passing in an OutputStream to
WriteOutContentHandler?
-Original Message-
From: ruby
My belief in making that recommendation was that a given document wouldn't
split a word across an element. I can, of course, think of exceptions (word
break at the end of a PDF page, for example), but generally, my assumption is
that this wouldn't happen very often. However, if this does
TimothyAllison
I’d like to start documenting tika-batch.
Thank you!
Best,
Tim
+1
Built in both salt water and fresh (er, Windows 7 and RHEL 6.5).
Thank you, Chris!
And thank you, Uwe and Nick, for the quick work to get poi-3.11-beta2 included!
-Original Message-
From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
Sent: Tuesday, September 02, 2014 7:24 AM
To:
Probably want to add TIKA-1411.
Nick and all, anything else?
-Original Message-
From: Hong-Thai Nguyen [mailto:thaicha...@gmail.com]
Sent: Thursday, September 11, 2014 10:10 AM
To: dev@tika.apache.org
Subject: Re: NPE on all *.odt, odp, .ods documents
Hi Chris,
Sound perfect too me.
Science Department
University of Southern California, Los Angeles, CA 90089 USA
++
-Original Message-
From: Allison, Timothy B. talli...@mitre.org
Reply-To: u...@tika.apache.org u...@tika.apache.org
Date: Thursday, September
All,
I have Intellij set to order imports by javax, java, then other. I think
this is the most common pattern in Tika. Is it ok if I make these
(meaningless/formatting) changes when I commit other changes?
Thank you.
Best,
Tim
, Timothy B. talli...@mitre.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Tuesday, October 21, 2014 at 1:59 PM
To: dev@tika.apache.org dev@tika.apache.org
Subject: import (re)ordering?
All,
I have Intellij set to order imports by javax, java, then other. I
think this is the most
Sorry for coming late to the game on the implications of TIKA-1445. I don't
want to hold up the release of 1.7.
However, would it be possible to return to the legacy default behavior of
extracting metadata from images?
We can then document on the OCR parser page on the wiki that you need
Hi Nick,
The build is working for me on linux and Windows with Java 1.7. Can you tell
which file is causing the problem? I wonder if the upgrade to PDFBox 1.8.7
caused the issue?
-Original Message-
From: Nick Burch [mailto:n...@apache.org]
Sent: Wednesday, October 29, 2014 4:40 PM
, 2014 9:00 AM
To: dev@tika.apache.org
Subject: RE: PDF test failing on trunk
On Thu, 30 Oct 2014, Allison, Timothy B. wrote:
The build is working for me on linux and Windows with Java 1.7. Can
you tell which file is causing the problem? I wonder if the upgrade to
PDFBox 1.8.7 caused
I think so. Would you like the honors?
-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org]
Sent: Thursday, October 30, 2014 9:23 AM
To: dev@tika.apache.org
Subject: RE: PDF test failing on trunk
On Thu, 30 Oct 2014, Allison, Timothy B. wrote:
Ha. Works with an older
and give it a test on my version
of 1.6.
-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org]
Sent: Thursday, October 30, 2014 12:01 PM
To: dev@tika.apache.org
Subject: RE: PDF test failing on trunk
On Thu, 30 Oct 2014, Allison, Timothy B. wrote:
I think so. Would you
Chris,
Thank you for moving this to the dev list. This would be a fairly large
change, and the discussion is valuable.
-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Monday, November 17, 2014 5:25 PM
To: dev@tika.apache.org
Subject:
All,
With many thanks to Sergey, I added JSON and XMP to “/meta” and I folded in
MetadataEP into MetadataResource so that users can request a specific metadata
value(s). (TIKA-1497, TIKA-1499)
I also added a new endpoint “/rmeta” that is equivalent to tika-app’s –J
(TIKA-1498) – JSONified
Uwe,
To confirm, we need to add this pluginManagement.../pluginManagement
fully as it is in the parent pom.xml, we should not put the plugin under our
regular plugins (which no longer have pluginManagement?
-Original Message-
From: Uwe Schindler [mailto:u...@thetaphi.de]
Sent:
Will do.
-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Friday, January 23, 2015 2:11 PM
To: dev@tika.apache.org
Subject: Re: Forbidden-APIS no longer ran because of carzy POM change
awesome. Thanks Uwe.
Tim you want to put that in, or
+1
Built successfully on both Windows 7 and RHEL 6.5 for me...no Tesseract
installed. Relying on post rc2 release eval for TIKA 1445 against trunk for no
new regressions. Manually confirmed image metadata is being extracted.
Thank you, Tyler!
Best,
Tim
/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++
-Original Message-
From: Allison, Timothy B
, though -- I'll wait for Tim's patch then send an RC#2.
Sound good?
Tyler
On Mon, Jan 5, 2015 at 8:09 AM, Allison, Timothy B. talli...@mitre.org
wrote:
All,
I think I may have found a problem with the interaction of
OutlookPSTParser with AutoDetectParser that I'd want to fix before 1.7
All,
Thanks to Nick for adding mime info for db files, we can now identify several
common db files.
What is the community's level of interest in adding parsers for databases that
store data in one file, such as .mdb, .dbf, .sqlite, .hsqldb ... (others?)?
Most of the jdbc drivers are not
Chris,
Is this on an updated and/or reverted trunk or on an modified rc-3?
I haven't gotten around to installing tesseract yet so I can't actually kick
the tires, but the last time there was a test for 5 items on line 91 of
RFC822ParserTest was in r1552405...before the fixes for TIKA-1422.
Chris,
Should we interpret this as -1 on rc3 from you? Or should we go forth with
testing and voting on rc3?
Thank you!
Best,
Tim
-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Monday, January
All,
I just started a wiki page for our discussion of Tika 2.0
(https://wiki.apache.org/tika/Tika2_0RoadMap). Please modify/edit/discuss as
you see fit.
On a related note, I also started a wiki page for our CompositeParser
strategy discussion
, February 13, 2015 7:51 AM
To: dev@tika.apache.org
Subject: Re: Parser that includes LGPL as provided?
On Fri, 13 Feb 2015, Allison, Timothy B. wrote:
After I dig myself out of several other issues that I'd like to tackle,
I'd like to add a parser for MSAccess files. There's a pure java LGPL
I'm working behind a proxy and getting a new proxy error (proxy
unacknowledged) with r1658847 on tika-server package.
-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org]
Sent: Wednesday, February 11, 2015 6:18 AM
To: dev@tika.apache.org
Subject: Re: svn commit: r1658847 -
All,
I just noticed that tika-app has gone from ~30MB to ~44MB, ~20k file to ~27k
files. 3.5 of those new MB are for README.NLDAS1.pdf and README.NLDAS2.pdf.
Can we exclude those in the app and server? Are there other items that we
should exclude?
Cheers,
All,
After I dig myself out of several other issues that I'd like to tackle, I'd
like to add a parser for MSAccess files. There's a pure java LGPL library,
Jackcess, available on maven, and it appears to be quite active.
I know we have a list of third party parsers, but I'm wondering if we
/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++
-Original Message-
From: Allison, Timothy B. talli...@mitre.org
Reply-To: dev
In the back of my memory, there's a ticket open for fixing the logged messages
from PDFBox (or maybe just fixing the pdfs that triggered the messages), but I
can't find it quickly. It may have been a smaller part of something that we've
already closed out, or it might still be open.
Tyler,
Hi Tyler,
This has started to irk me as well, a bit. I don't think there's much overlap,
although there is some. I think navigating standard package resource paths
might be cumbersome even with a good IDE... perhaps start with high-level
subdirectories as chm is now doing?
-Original
Once we fix TIKA-1584, I don't have a preference. I defer to Chris's
experience (so I guess, +1 for 1.8) given the amount of work required.
It'd be great if we could make sure we aren't bundling any pdfs in our tika-app
jar, too. Many apologies if that's been fixed!
-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, March 30, 2015 7:03 AM
To: dev@tika.apache.org
Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1
Unless there are objections, I'd like these to be resolved before 1.8:
TIKA-1584 -- I'll fix
TIKA-1575 -- Resolved by Konstantin Gribov (thank
the hyperlink into a new doc and change the URL? I have no
idea about including the modified version.
Tyler
On Mar 30, 2015 9:18 AM, Allison, Timothy B. talli...@mitre.org wrote:
All,
As part of TIKA-1512, I found that I can delete all of the contents,
including the metadata, except for one hyperlink
All,
I've made the changes that I had hoped to. Grib pdf exclusion remains for any
takers.
Let me know when I should initiate the run against govdocs1 to see if there are
any surprises on that corpus with Tika 1.8.
Best,
Tim
-Original Message-
From: Allison, Timothy B
Backwards compatibility issue found by clirr on TIKA-1587
[INFO] --- clirr-maven-plugin:2.3:check (default) @ tika-core ---
[ERROR] org.apache.tika.fork.ForkParser: Return type of method 'public
java.lang.String getJavaCommand()' has been changed to java.util.List
[ERROR]
Unless there are objections, I'd like these to be resolved before 1.8:
TIKA-1584 -- I'll fix
TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, but I'll
leave this open and do some more digging to see if we need to open a
How much of an effort would it be to migrate somewhat slowly:
Leave in but deprecate setCommandLine(String ) and String getCommandLine()
Add something like: setCommandLineArr(String[] ) and String[]
getCommandLineArr()?
-Original Message-
From: Konstantin Gribov
) to avoid build failure. And use new ones internally.
I'll do `mvn verify` before commiting this time. Sorry for inconvenience.
--
Best regards,
Konstantin Gribov
пн, 30 марта 2015 г. в 18:09, Allison, Timothy B. talli...@mitre.org:
How much of an effort would it be to migrate somewhat slowly
I wonder if it is time to do a re-copy. :)
-Original Message-
From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
Sent: Friday, March 20, 2015 5:17 PM
To: dev@tika.apache.org
Subject: Re: Licensing Question
Perfect. I should have thought of the commit message. Thank you, Ken!
Tyler
On
Might be thinking of TIKA-944?
Mind if we switch the CORS short option to -C and use -c for the tika config
file?
-Original Message-
From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
Sent: Wednesday, April 01, 2015 11:13 AM
To: dev@tika.apache.org
Subject: Re: Access Control Allow
, Timothy B.
Cc: dev@tika.apache.org
Subject: Rackspace VM and Standing up Tika Server
Hi Tim,
Can you please fill us in with the current status with the Tika + Rackspace
effort.
I have neglected this so apologies.
I want to document what is available on the Tika wiki so we do not loose it
again.
I
+1 to dropping 1.6...let's move to 1.8 and beyond! :)
-Original Message-
From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com]
Sent: Thursday, January 29, 2015 6:51 PM
To: dev@tika.apache.org
Subject: TIKA-1423 Build a parser to extract data from GRIB formats not good
with Java
at 2:55 PM, Allison, Timothy B. talli...@mitre.org
wrote:
This looks like a Hudson hiccup.
Tyler is seeing excessive logging:
Running org.apache.tika.cli.TikaCLIBatchIntegrationTest
INFO - about to start driver
INFO - about to start driver
Anyone else having problems building from a fresh
Sorry, link wasn’t included:
https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
From: tallison314...@gmail.com [mailto:tallison314...@gmail.com]
Sent: Friday, April 03, 2015 8:35 AM
To: d...@pdfbox.apache.org; dev@tika.apache.org; d...@poi.apache.org
Subject: Fwd: Any interest in
Thank you, Tyler!
-Original Message-
From: Tyler Palsulich [mailto:tpalsul...@apache.org]
Sent: Monday, April 20, 2015 5:09 PM
To: dev@tika.apache.org; u...@tika.apache.org; annou...@apache.org
Subject: [ANNOUNCE] Apache Tika 1.8 Released
The Apache Tika project is pleased to announce
Oops, our emails passed in the ether. Thank you, Jukka!
-Original Message-
From: Jukka Zitting [mailto:jukka.zitt...@gmail.com]
Sent: Wednesday, April 22, 2015 12:06 PM
To: dev@tika.apache.org
Subject: Re: comparing Tika's file detect with other tools?
Hi,
Copyright also covers
: Allison, Timothy B.
Sent: April 22, 2015 5:47:17am PDT
To: dev@tika.apache.org
Subject: comparing Tika's file detect with other tools?
Would it be frowned upon to compare Tika's file detection with other
tools, like file? Any concerns about effectively reverse engineering
(when we find
Would it be frowned upon to compare Tika's file detection with other tools,
like file? Any concerns about effectively reverse engineering (when we find
that Tika is wrong) from a non-Apache project?
Any other sensitivities I should be aware of?
Best,
Tim
wrote:
From: Allison, Timothy B.
Sent: April 20, 2015 5:11:04am PDT
To: dev@tika.apache.org
Subject: RE: [VOTE] Apache Tika 1.8 Release Candidate #2
If I understand correctly, if we release rc2, Tika 1.8 will break in
Hadoop clusters across the land?!
Or, Hadoop folks will have
Hi All,
I can't remember where we are on this. Are we dropping support for Java 1.6
in Tika 1.9? If so, should we open an issue to integrate tika-java7 into core,
add diamond operators, catching multiple exceptions... anything else...?
Or, do we want to wait for Tika 2.0 or Tika 1.10?
: [VOTE] Apache Tika 1.8 Release Candidate #2
Hi Tim
Great to hear that you managed to use the dataset from CommonCrawl. Thanks!
Julien
On 14 April 2015 at 14:15, Allison, Timothy B. talli...@mitre.org wrote:
+1
Thank you, Tyler!
Apologies to Hong-Thai and community for not recognizing
to make sure the above issues are
(believed to be) settled before the next cut.
Thanks,
Tyler
On Apr 10, 2015 4:55 PM, David Meikle loo...@gmail.com wrote:
On 10 Apr 2015, at 11:38, Allison, Timothy B. talli...@mitre.org
wrote:
I agree that the ODT issue might require a respin. What do
+1
Thank you, Tyler!
Apologies to Hong-Thai and community for not recognizing the severity of
TIKA-1600 when I voted in favor of rc1!
Details...
I reran against govdocs1, and there aren't any major surprises.
On our Rackspace vm, I _finally_ unzipped the Common Crawl slice that Julien
here:
https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to the
Subject line. Please invite others who might have an interest in this work.
Best,
Tim
From: Allison, Timothy B.
Sent
This looks like a Hudson hiccup.
Tyler is seeing excessive logging:
Running org.apache.tika.cli.TikaCLIBatchIntegrationTest
INFO - about to start driver
INFO - about to start driver
Anyone else having problems building from a fresh trunk?
-Original Message-
From: Hudson (JIRA)
I just finished the against govdocs1 with 1.7 vs. 1.8-rc1, and all looks good
with one major change... on first glance.
Because of my fix on TIKA-1519 and the law of unintended consequences, files
that start like so:
!DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN
For those who want to take a look at the reports (much more work is needed on
processing stack traces for SORT_STACK_TRACE):
https://github.com/tballison/share/blob/master/tika_comparisons/tika_1_7_v_1_8-rc1.zip
All,
What do you think?
https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
On Friday, April 3, 2015 at 8:23:11 AM UTC-4,
talliso...@gmail.commailto:talliso...@gmail.com wrote:
CommonCrawl currently has the WET format that extracts plain text from web
pages. My guess is that
-excel
6116
847/847762.ppt
847762.ppt/992
application/vnd.ms-excel
6119
Looks like the majority are embedded in ppt, but there are several embedded in
xls as well.
Cheers,
Tim
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Wednesday
Fixed eval code, thanks to Nick.
Now running against doc/x list fixes to confirm success.
Will rerun tomorrow on full set, with results by noon ETD.
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Wednesday, June 03, 2015 7:28 AM
To: dev@tika.apache.org
Thank you, Nick!
-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org]
Sent: Friday, June 05, 2015 6:15 AM
To: dev@tika.apache.org
Subject: RE: [DISCUSS] 1.9 Tika release?
text/dif+xml-application/dif+xml
Expected and fine
Agreed on the mime type, but is there a reason
+1
Built in Windows and Linux. Works on problems (that I caused!) in rc1.
Let's make sure to include last Java 1.6 version in the release notes, if
that's what we've decided.
Thank you, Chris!
Best,
Tim
-Original Message-
From: Mattmann, Chris A (3980)
1 - 100 of 428 matches
Mail list logo