[jira] [Commented] (TIKA-1804) Tika use no free json.org

2016-11-18 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15678151#comment-15678151 ] Nick Burch commented on TIKA-1804: -- Ted Dunning has produced a hopefully drop-in replacement (based

[jira] [Reopened] (TIKA-1804) Tika use no free json.org

2016-11-16 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch reopened TIKA-1804: -- The ASF legal team have recently changed their mind on the license (see https://lists.apache.org

[jira] [Commented] (TIKA-2159) Handle pre-parse embedded object exceptions uniformly and more robustly

2016-11-09 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651535#comment-15651535 ] Nick Burch commented on TIKA-2159: -- Given that we don't control all the parsers, I'm worried things my

[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-11-04 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636102#comment-15636102 ] Nick Burch commented on TIKA-2146: -- My guess is it's about 2-3 weeks of work at the POI level to add

[jira] [Commented] (TIKA-2146) Unable to extract contents from protected MS word-doc-java.lang.ArrayIndexOutOfBoundsException

2016-10-27 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614327#comment-15614327 ] Nick Burch commented on TIKA-2146: -- As per https://poi.apache.org/encryption.html, there's no support

[jira] [Commented] (TIKA-2144) NullPointerException on a valid Word file

2016-10-27 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15610867#comment-15610867 ] Nick Burch commented on TIKA-2144: -- Do you know how the file in question was generated? It seems to have

[jira] [Commented] (TIKA-2122) Extract all email headers from Outlook .msg files into Metadata

2016-10-16 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579692#comment-15579692 ] Nick Burch commented on TIKA-2122: -- I'm not sure if we want to be dumping these raw into the Tika metadata

Re: Tika parsers 1.14-SNAPSHOT parses empty content depending to Apache POI 3.15

2016-10-12 Thread Nick Burch
On Wed, 12 Oct 2016, Simone Tripodi wrote: while upgrading the system where I've been working on, I updated Apache POI to version 3.15, then Tika (currently tika-parsers-1.7, I am testing tika-parsers-1.14-SNAPSHOT) You can't just upgrade one jar. You need to use all of the POI jars together

Re: tika-2.x - Build # 156 - Failure

2016-10-05 Thread Nick Burch
On Wed, 5 Oct 2016, Apache Jenkins Server wrote: The Apache Jenkins build system has built tika-2.x (build #156) Check console output at https://builds.apache.org/job/tika-2.x/156/ to view the results. Another one for our Jenkins experts. Looks like it needs a bit more memory for the job,

Re: tika-2.x-windows - Build # 60 - Still Failing

2016-10-05 Thread Nick Burch
On Wed, 5 Oct 2016, Apache Jenkins Server wrote: The Apache Jenkins build system has built tika-2.x-windows (build #60) Check console output at https://builds.apache.org/job/tika-2.x-windows/60/ to view the results. Anyone with Jenkins-foo able to fix our Windows Jenkin builds? This failed

[jira] [Commented] (TIKA-2107) Old MS Word files give error while indexing

2016-10-05 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548427#comment-15548427 ] Nick Burch commented on TIKA-2107: -- The attached file is an old Word 2 file, not supported by POI

[jira] [Commented] (TIKA-2107) Old MS Word files give error while indexing

2016-10-03 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15542115#comment-15542115 ] Nick Burch commented on TIKA-2107: -- What error are you getting? How are you calling Tika? Are you really

[jira] [Commented] (TIKA-2099) Tar files without magic bytes are sporadically detected as text

2016-09-29 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534266#comment-15534266 ] Nick Burch commented on TIKA-2099: -- This patch removes some special handling put in place for COMPRESS-117

Re: Plans for the first Tika 2.0 release

2016-09-21 Thread Nick Burch
On Mon, 19 Sep 2016, Bob Paulin wrote: I think it's a good thing to discuss. I know there are other features that are targeted for 2.0. Do we have a general sense of where those features are at? I think the big one we need to crack is allowing multiple parsers to run against a file. OCR is

[jira] [Commented] (TIKA-2087) Extracting text from xml file failing.

2016-09-21 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509659#comment-15509659 ] Nick Burch commented on TIKA-2087: -- This is an invalid XML file. You need to fix it so

[jira] [Commented] (TIKA-2086) Metadata also getting extracted along with document text

2016-09-21 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509655#comment-15509655 ] Nick Burch commented on TIKA-2086: -- How are you calling Apache Tika? Is this happening for all files

[jira] [Commented] (TIKA-1997) Problem in Tika().detect for xml file signed in CADES

2016-09-19 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504023#comment-15504023 ] Nick Burch commented on TIKA-1997: -- Running your file through the openssl tool {{ asn1parse }}, it shows

[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-15 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493924#comment-15493924 ] Nick Burch commented on TIKA-2069: -- Yes! If you wrote a VB Script, and zipped it up, it'd be a {{text/x

[jira] [Commented] (TIKA-2058) Memory Leak in Tika version 1.13 when parsing millions of files

2016-09-14 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15490844#comment-15490844 ] Nick Burch commented on TIKA-2058: -- The code posted above isn't calling {{close}} on the {{MAPIMessage

[jira] [Commented] (TIKA-2058) Memory Leak in Tika version 1.13 when parsing millions of files

2016-09-14 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15490717#comment-15490717 ] Nick Burch commented on TIKA-2058: -- Looking at the patch, I'm not sure how it will help? When

RE: Query on correct use of 'fileUrl' in TikaJAXRS Server to extract document at remote url - my request is not working

2016-09-14 Thread Nick Burch
On Wed, 14 Sep 2016, Allison, Timothy B. wrote: Would it be as much of a disaster to require the user to allow the fileUrl capability on the commandline at server startup? We could add some menacing "all bets are off, we hope you know what you're doing" warning. With a special switch, and a

Re: A new Tika App in 2.0?

2016-09-13 Thread Nick Burch
On Sun, 11 Sep 2016, Bob Paulin wrote: I'd like to propose a new Tika App for the 2.0 branch. One of the reasons we broke apart the Tika parsers into modules was due to the complexity of having to deal with all the parser dependencies and transitive dependencies. Now developers can use just

[jira] [Commented] (TIKA-2064) Document type detected incorrectly for Stata datasets (.dta extension)

2016-09-13 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488175#comment-15488175 ] Nick Burch commented on TIKA-2064: -- Are you happy to dual-license it as Apache License, Version 2.0? We

[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-13 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487662#comment-15487662 ] Nick Burch commented on TIKA-2069: -- I think the idea of a Macro is probably general enough across a range

[jira] [Commented] (TIKA-2064) Document type detected incorrectly for Stata datasets (.dta extension)

2016-09-13 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487585#comment-15487585 ] Nick Burch commented on TIKA-2064: -- Magic added in 3c0abc8eb. No unit tests yet though, we can add them

RE: Query on correct use of 'fileUrl' in TikaJAXRS Server to extract document at remote url - my request is not working

2016-09-13 Thread Nick Burch
On Tue, 13 Sep 2016, John Dougrez-Lewis wrote: Surely the security vulnerability could have been fixed by disallowing "file://" variants in the URL rather than removing the feature altogether? Or were there other implementation issues relating to the fileUrl feature that meant it was best

[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-12 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484834#comment-15484834 ] Nick Burch commented on TIKA-2069: -- I think that, given both how big macros can get and how they logically

[jira] [Commented] (TIKA-2064) Document type detected incorrectly for Stata datasets (.dta extension)

2016-09-12 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484812#comment-15484812 ] Nick Burch commented on TIKA-2064: -- If you could, that would be most helpful! > Document type detec

[jira] [Commented] (TIKA-2064) Document type detected incorrectly for Stata datasets (.dta extension)

2016-08-29 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446089#comment-15446089 ] Nick Burch commented on TIKA-2064: -- >From a quick google, `application/x-stata-dta` seems to be what ot

Re: Tika 1.14?

2016-08-12 Thread Nick Burch
On Thu, 11 Aug 2016, Bob Paulin wrote: I know it's been a little bit since we talked about 2.0. We had discussed holding off while some API changes that were under consideration. Has any progress been made on this? I think we're still trying to come up with a plan for how to allow multiple

[jira] [Commented] (TIKA-1367) Tika documentation should list tika-parsers parser dependencies

2016-08-05 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409311#comment-15409311 ] Nick Burch commented on TIKA-1367: -- The code exists and you can check out the more modular parsers already

[jira] [Commented] (TIKA-1367) Tika documentation should list tika-parsers parser dependencies

2016-08-05 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409272#comment-15409272 ] Nick Burch commented on TIKA-1367: -- This should be largely fixed on the 2.x branch, which has more modular

[jira] [Commented] (TIKA-2046) Can not read PDF correctly

2016-08-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401757#comment-15401757 ] Nick Burch commented on TIKA-2046: -- As per the troubleshooting guide, if one of your files doesn't work

[jira] [Commented] (TIKA-2046) Can not read PDF correctly

2016-07-31 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401089#comment-15401089 ] Nick Burch commented on TIKA-2046: -- Can you try following the steps in https://wiki.apache.org/tika

[jira] [Commented] (TIKA-2045) TIKA crashes / runs out of memory on simple PDF

2016-07-28 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397499#comment-15397499 ] Nick Burch commented on TIKA-2045: -- Sounds like it's checking permissions then skipping extraction, so

[jira] [Commented] (TIKA-2045) TIKA crashes / runs out of memory on simple PDF

2016-07-28 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397442#comment-15397442 ] Nick Burch commented on TIKA-2045: -- As per https://wiki.apache.org/tika/Troubleshooting%20Tika

[jira] [Commented] (TIKA-2044) MboxParser wrongly concatenates multiple text lines into single header line

2016-07-27 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15396477#comment-15396477 ] Nick Burch commented on TIKA-2044: -- Are you able to reproduce this in a simple junit unit test case

[jira] [Commented] (TIKA-2041) Charset detection doesn't appear to be thread-safe

2016-07-27 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395282#comment-15395282 ] Nick Burch commented on TIKA-2041: -- Running "git log" and "git diff" on the file s

[jira] [Commented] (TIKA-2041) Charset detection doesn't appear to be thread-safe

2016-07-26 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394458#comment-15394458 ] Nick Burch commented on TIKA-2041: -- We added the {{EBCDIC_500_}} family of detectors into our own copy

[jira] [Commented] (TIKA-2042) MBOX file detected wrongly as text/html

2016-07-26 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393617#comment-15393617 ] Nick Burch commented on TIKA-2042: -- Fixed in {{72d2d88b381ba75942ae791042ef54af33ee1f38}} - your test file

[jira] [Resolved] (TIKA-2042) MBOX file detected wrongly as text/html

2016-07-26 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-2042. -- Resolution: Fixed Fix Version/s: 1.14 > MBOX file detected wrongly as text/h

[jira] [Resolved] (TIKA-2037) Problems with email attachments

2016-07-20 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-2037. -- Resolution: Fixed Fix Version/s: 1.14 Fixed in 952fb54 along with a simpler unit test inspired

[jira] [Commented] (TIKA-2037) Problems with email attachments

2016-07-20 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386159#comment-15386159 ] Nick Burch commented on TIKA-2037: -- I've just tried with a 1.14 snapshot build, and both are detected

[jira] [Commented] (TIKA-2032) OptimaizeLangDetector can not be resolved

2016-07-20 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386093#comment-15386093 ] Nick Burch commented on TIKA-2032: -- It is contained in the {{tika-langdetect}} module, which is optional

[jira] [Commented] (TIKA-2025) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.13 doesn’t yield the expected results

2016-06-30 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357125#comment-15357125 ] Nick Burch commented on TIKA-2025: -- We could always test the formatted value for {{E+}} (or {{E

[jira] [Commented] (TIKA-2017) Tika Server Cannot handle large files

2016-06-23 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347398#comment-15347398 ] Nick Burch commented on TIKA-2017: -- The server ought to be pushing the XML out to the client

[jira] [Commented] (TIKA-1358) Add support for newer iWork file formats

2016-06-22 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15344390#comment-15344390 ] Nick Burch commented on TIKA-1358: -- The OOXML stuff uses a {{.version suffix}}, so if we followed

[jira] [Commented] (TIKA-1358) Add support for newer iWork file formats

2016-06-22 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343896#comment-15343896 ] Nick Burch commented on TIKA-1358: -- Commons Compress 1.12 is out, with our required snappy support

[jira] [Commented] (TIKA-2015) MAPIMessage String fileName constructor leaves file open

2016-06-19 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15338814#comment-15338814 ] Nick Burch commented on TIKA-2015: -- Fixed on the POI side in r1749213, will be included in POI 3.15 beta 2

[jira] [Commented] (TIKA-2004) Add mime detection for Windows Media Metafile, PRONOM: application/x-puid-fmt-584

2016-06-15 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15331585#comment-15331585 ] Nick Burch commented on TIKA-2004: -- Wikipedia claims - https://en.wikipedia.org/wiki

[jira] [Commented] (TIKA-2003) Tika 1.13 gpg signature not validating.

2016-06-13 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327727#comment-15327727 ] Nick Burch commented on TIKA-2003: -- Looks like David hasn't added his GPG key fingerprint to his profile

[jira] [Commented] (TIKA-2001) Parsing XML outputs empty string

2016-06-09 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15322404#comment-15322404 ] Nick Burch commented on TIKA-2001: -- What's the output of `--detect` on the problematic file? > Pars

[jira] [Resolved] (TIKA-1989) Weird sentence in website

2016-05-28 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1989. -- Resolution: Fixed Fix Version/s: 1.14 Markup fixed in r1745867, and deployed to the site

[jira] [Commented] (TIKA-1989) Weird sentence in website

2016-05-28 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15305313#comment-15305313 ] Nick Burch commented on TIKA-1989: -- There's more text in the src apt file - I'll have to work out why

Re: TIKA not returning exact MIME type

2016-05-27 Thread Nick Burch
On Fri, 27 May 2016, Rahul Khandelwal wrote: I am using detect/stream api to retrieve the MIME type of the file. But it's not returning exact MIME type for some document if i am passing 1KB of data of that file. That's expected For example - For open office document it's returning

[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-05-25 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300737#comment-15300737 ] Nick Burch commented on TIKA-1513: -- I haven't read much on the format, but I'd be tempted to maybe have

[jira] [Comment Edited] (TIKA-1513) Add mime detection and parsing for dbf files

2016-05-25 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300737#comment-15300737 ] Nick Burch edited comment on TIKA-1513 at 5/25/16 7:52 PM: --- I haven't read much

[jira] [Commented] (TIKA-1979) Issue message when server mode has started

2016-05-23 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296225#comment-15296225 ] Nick Burch commented on TIKA-1979: -- For production use, I'd suggest you switch to the Tika Server

Re: HtmlParser regression (TIKA-1938)

2016-05-20 Thread Nick Burch
On Fri, 20 May 2016, Joseph Naegele wrote: I introduced a regression in the HtmlParser in TIKA-1938, which added the ability to emit parsed tags found in the HTML .