[jira] [Commented] (TIKA-2478) MBOX import includes redundant copies of the text

2017-10-17 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207598#comment-16207598 ] Nick Burch commented on TIKA-2478: -- Following the outlook parser model seems likely to deliver "

[jira] [Commented] (TIKA-2473) PCX and DCX image support

2017-10-06 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195579#comment-16195579 ] Nick Burch commented on TIKA-2473: -- I've added some test files, mime magic and detection. The magic

Re: [DISCUSS] Enable specific ContentHandler for tika-server

2017-09-29 Thread Nick Burch
On Fri, 29 Sep 2017, Giuseppe Totaro wrote: To sum up, I would like to quickly discuss the following aspects: - As you all mentioned, the HTTP headers for configuring the ContentHandler to be used are better suited for the dynamic cases. Specifically, a ContentHadler can be given through

Re: [DISCUSS] Enable specific ContentHandler for tika-server

2017-09-28 Thread Nick Burch
On Thu, 28 Sep 2017, Giuseppe Totaro wrote: if I am not wrong, currently you cannot configure a specific ContentHandler while using tika-server. I mean that you can configure your own parser [0] but you cannot control which ContentHandler the parser leverages to extract text and metadata (e.g.,

[jira] [Commented] (TIKA-2466) Remove JAXB usage

2017-09-15 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167672#comment-16167672 ] Nick Burch commented on TIKA-2466: -- Thanks [~rombert]. I'll give it a day or so for people to ponder

[jira] [Commented] (TIKA-2466) Remove JAXB usage

2017-09-15 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167557#comment-16167557 ] Nick Burch commented on TIKA-2466: -- [~talli...@mitre.org] The methods not being static on {{ParseContext

[jira] [Commented] (TIKA-2462) Add a parser for sas7bdat

2017-09-14 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167063#comment-16167063 ] Nick Burch commented on TIKA-2462: -- I've just had a quick try with the library, against a test SAS file

[jira] [Comment Edited] (TIKA-2462) Add a parser for sas7bdat

2017-09-14 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167063#comment-16167063 ] Nick Burch edited comment on TIKA-2462 at 9/14/17 10:37 PM: I've just had

[jira] [Commented] (TIKA-2466) Remove JAXB usage

2017-09-14 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166949#comment-16166949 ] Nick Burch commented on TIKA-2466: -- If we're going to use {{DocumentBuilderFactory}}, then we need to make

[jira] [Commented] (TIKA-2461) Wordperfect file identified as Quattro Pro document

2017-09-08 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158736#comment-16158736 ] Nick Burch commented on TIKA-2461: -- This may be tricky - I've just tried with our test Quattro Pro 7/8

[jira] [Commented] (TIKA-2460) Possibility to add custom-mimetypes.xml (and/or also other configuration files) from location outside classpath

2017-09-08 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158695#comment-16158695 ] Nick Burch commented on TIKA-2460: -- I'd tweak the comment to {{System property to set a path

[jira] [Commented] (TIKA-2461) Wordperfect file identified as Quattro Pro document

2017-09-08 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158632#comment-16158632 ] Nick Burch commented on TIKA-2461: -- Assuming you have the Tika App jar to hand, you can just run

[jira] [Commented] (TIKA-2461) Wordperfect file identified as Quattro Pro document

2017-09-08 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158539#comment-16158539 ] Nick Burch commented on TIKA-2461: -- Could you try running {{org.apache.poi.poifs.dev.POIFSLister}} against

[jira] [Commented] (TIKA-2460) Possibility to add custom-mimetypes.xml (and/or also other configuration files) from location outside classpath

2017-09-07 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156703#comment-16156703 ] Nick Burch commented on TIKA-2460: -- For $DAYJOB, we've configured Tomcat to have {{$\{catalina.base

[jira] [Commented] (TIKA-2450) OfficeParser.parse called for zero-byte file with .doc extension

2017-08-30 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147322#comment-16147322 ] Nick Burch commented on TIKA-2450: -- In Windows, right click on a folder, New then Word Document

[jira] [Resolved] (TIKA-2447) PSDParser creates unnecessary large byte array and discards it

2017-08-24 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-2447. -- Resolution: Fixed Fix Version/s: 1.17 > PSDParser creates unnecessary large byte ar

[jira] [Resolved] (TIKA-2445) Windows BAT / CMD detection

2017-08-23 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-2445. -- Resolution: Fixed Fix Version/s: 1.17 Both are now detected as {{application/x-bat}} , which

[jira] [Created] (TIKA-2445) Windows BAT / CMD detection

2017-08-23 Thread Nick Burch (JIRA)
Nick Burch created TIKA-2445: Summary: Windows BAT / CMD detection Key: TIKA-2445 URL: https://issues.apache.org/jira/browse/TIKA-2445 Project: Tika Issue Type: Bug Components: mime

[jira] [Commented] (TIKA-2443) Plain text file identified as rfc822 and which can cause StackOverflowError

2017-08-23 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138374#comment-16138374 ] Nick Burch commented on TIKA-2443: -- Tika doesn't care where you put the file, as long as the classloader

[jira] [Commented] (TIKA-2443) Plain text file identified as rfc822 and which can cause StackOverflowError

2017-08-17 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130726#comment-16130726 ] Nick Burch commented on TIKA-2443: -- It doesn't matter what priority we put on the Date magic

[jira] [Resolved] (TIKA-1367) Tika documentation should list tika-parsers parser dependencies

2017-08-09 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1367. -- Resolution: Invalid Glad to hear it's sorted! Based on the stackoverflow post, it's a tricky artifact

[jira] [Commented] (TIKA-2436) Support for GZIP-compressed EMF files

2017-07-29 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106121#comment-16106121 ] Nick Burch commented on TIKA-2436: -- In a similar way to how we handle WMZ files, I've added a new mime

[jira] [Commented] (TIKA-2433) Tika 1.16 - Nullpointer Exception after update - Asking for help

2017-07-25 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100166#comment-16100166 ] Nick Burch commented on TIKA-2433: -- As it's in a deprecated part of the codebase, I'm not sure we'd do

[jira] [Resolved] (TIKA-2433) Tika 1.16 - Nullpointer Exception after update - Asking for help

2017-07-25 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-2433. -- Resolution: Fixed Fix Version/s: 1.17 I can reproduce the problem, hopefully fixed

[jira] [Commented] (TIKA-2433) Tika 1.16 - Nullpointer Exception after update - Asking for help

2017-07-25 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100021#comment-16100021 ] Nick Burch commented on TIKA-2433: -- What are the arguments you are passing to the Tika App? > Tika 1

[jira] [Commented] (TIKA-2042) MBOX file detected wrongly as text/html

2017-07-16 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089013#comment-16089013 ] Nick Burch commented on TIKA-2042: -- [~mcaruanagalizia] I've added some more rfc822 magic, which I think

[jira] [Commented] (TIKA-2042) MBOX file detected wrongly as text/html

2017-07-13 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085747#comment-16085747 ] Nick Burch commented on TIKA-2042: -- [~mcaruanagalizia] I've added some more patterns

[jira] [Resolved] (TIKA-2422) Improve detection of Graphviz *.dot format

2017-07-06 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-2422. -- Resolution: Fixed Fix Version/s: 1.16 Thanks for this! Patch applied > Improve detect

[jira] [Commented] (TIKA-2399) Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000

2017-07-05 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074825#comment-16074825 ] Nick Burch commented on TIKA-2399: -- We can properly fix this in 2.x when we sort out how to have multiple

[jira] [Commented] (TIKA-2419) Try HTML mime magic on broken XML files

2017-07-05 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074806#comment-16074806 ] Nick Burch commented on TIKA-2419: -- One fix might be to drop the priority of the XML magic to 40 to match

[jira] [Created] (TIKA-2419) Try HTML mime magic on broken XML files

2017-07-05 Thread Nick Burch (JIRA)
Nick Burch created TIKA-2419: Summary: Try HTML mime magic on broken XML files Key: TIKA-2419 URL: https://issues.apache.org/jira/browse/TIKA-2419 Project: Tika Issue Type: Bug

[jira] [Commented] (TIKA-2418) English ASCII text classified as video/quicktime

2017-07-05 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074649#comment-16074649 ] Nick Burch commented on TIKA-2418: -- Hopefully fixed in 0815b2144cf013e1a0803cee72d8076e8c544716 - I've

[jira] [Resolved] (TIKA-2418) English ASCII text classified as video/quicktime

2017-07-05 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-2418. -- Resolution: Fixed Fix Version/s: 1.16 > English ASCII text classified as video/quickt

[jira] [Commented] (TIKA-1367) Tika documentation should list tika-parsers parser dependencies

2017-07-05 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074648#comment-16074648 ] Nick Burch commented on TIKA-1367: -- [~talli...@mitre.org] I'm not sure there is - we've fixed it in Tika

Re: documenting configuration

2017-07-03 Thread Nick Burch
On Mon, 3 Jul 2017, Allison, Timothy B. wrote: To help a user configure a parameter in the PDFParser, I just started: https://wiki.apache.org/tika/TikaConfig. I realize, though, that I probably should update: https://tika.apache.org/1.15/configuring.html instead. Preferences,

[jira] [Resolved] (TIKA-2409) Tar has different mime type by name vs contents

2017-07-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-2409. -- Resolution: Not A Problem > Tar has different mime type by name vs conte

[jira] [Commented] (TIKA-2409) Tar has different mime type by name vs contents

2017-07-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071387#comment-16071387 ] Nick Burch commented on TIKA-2409: -- This is as expected. GTar is a specialisation of tar. Not all tar

[jira] [Commented] (TIKA-2407) Tika crashed while parsing corrupt PDF

2017-06-30 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070055#comment-16070055 ] Nick Burch commented on TIKA-2407: -- You'd be best off reporting this to the Apache PDFBox project, which

[jira] [Commented] (TIKA-2399) Version conflict with non-ASL jai-imageio-jpeg2000 and edu.ucar jj2000

2017-06-20 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055588#comment-16055588 ] Nick Burch commented on TIKA-2399: -- The latest gradle has an experimental plugin for generating a maven

[jira] [Commented] (TIKA-2394) Unknown message type: IPM.Note.Rules.OofTemplate.Microsoft

2017-06-15 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050220#comment-16050220 ] Nick Burch commented on TIKA-2394: -- PST support is provided by libjava-pst. It looks like we're

[jira] [Commented] (TIKA-1945) Powerpoint parser doesn't extract text from diagrams

2017-06-08 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043571#comment-16043571 ] Nick Burch commented on TIKA-1945: -- I don't know exactly what Tim'll do, but assuming it's similar to what

[jira] [Commented] (TIKA-1945) Powerpoint parser doesn't extract text from diagrams

2017-06-08 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16042865#comment-16042865 ] Nick Burch commented on TIKA-1945: -- A small sample file we can use for unit testing is needed, one per

[jira] [Resolved] (TIKA-2388) Problem in Tika().detect for ODB (Open Office database) files

2017-06-08 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-2388. -- Resolution: Fixed Fix Version/s: 1.16 > Problem in Tika().detect for ODB (Open Office datab

[jira] [Commented] (TIKA-2378) Error extracting text from application/x-msaccess mime type

2017-05-25 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025536#comment-16025536 ] Nick Burch commented on TIKA-2378: -- This looks to be a bug in Jackcess, the underlying Java library

[jira] [Commented] (TIKA-2376) Avoid org.json dependency

2017-05-25 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16024671#comment-16024671 ] Nick Burch commented on TIKA-2376: -- Tika Parsers already has a dependency on both {{com.googlecode.json

[jira] [Commented] (TIKA-2376) Avoid org.json dependency

2017-05-25 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16024433#comment-16024433 ] Nick Burch commented on TIKA-2376: -- I seem to recall that someone (Ted Dunning perhaps?) has written

Re: Tika App, Extract (-z) and Inline PDF Images?

2017-05-22 Thread Nick Burch
On 2017-05-18 17:02 (-0400), Nick Burch wrote: Hi All> I've just been caught out by the Tika App's -z on a PDF not extracting the > embedded images. I think we probably shouldn't tweak the default config > for the other Tika App modes, but what about extract? Any reason why we > sh

RE: Tika 1.15

2017-05-22 Thread Nick Burch
On Mon, 22 May 2017, Allison, Timothy B. wrote: Last I remember, Tyler had some detailed notes...anyone remember where those are? https://wiki.apache.org/tika/ReleaseProcess Nick

Tika App, Extract (-z) and Inline PDF Images?

2017-05-18 Thread Nick Burch
Hi All I've just been caught out by the Tika App's -z on a PDF not extracting the embedded images. I think we probably shouldn't tweak the default config for the other Tika App modes, but what about extract? Any reason why we shouldn't turn on the PDF Parser option "extractInlineImages" when

[jira] [Commented] (TIKA-2372) OSX DMG support

2017-05-18 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16016324#comment-16016324 ] Nick Burch commented on TIKA-2372: -- For a GPL licensed library, catacombae <https://sourceforge.ne

[jira] [Created] (TIKA-2372) OSX DMG support

2017-05-18 Thread Nick Burch (JIRA)
Nick Burch created TIKA-2372: Summary: OSX DMG support Key: TIKA-2372 URL: https://issues.apache.org/jira/browse/TIKA-2372 Project: Tika Issue Type: Improvement Components: parser

[jira] [Commented] (TIKA-2365) Signer's Information doesn't match issue

2017-05-17 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16014065#comment-16014065 ] Nick Burch commented on TIKA-2365: -- Looks like Batik may have inlined some or all of Commons IO. I'd

[jira] [Commented] (TIKA-2362) Skipping Header and Footer data from documents

2017-05-17 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16014059#comment-16014059 ] Nick Burch commented on TIKA-2362: -- Which format(s) are you having that problem with? Is that all

Re: Tika talk next week - help needed!

2017-05-16 Thread Nick Burch
On Tue, 16 May 2017, Eric Pugh wrote: It was great to read through http://events.linuxfoundation.org/sites/events/files/slides/WhatsNewWithApacheTika_1.pdf… Wow there is a lot in Tika. And I think that might be the one challenge with the talk structure, there is SOO much information. The

Tika talk next week - help needed!

2017-05-14 Thread Nick Burch
Hi All Last year in Seville, I gave a talk on Tika entitled "Apache Tika - What’s new with 2.0?". For ApacheCon Miami next week, I've been roped into giving an updated version... https://apachecon2017.sched.com/event/9zvD/apache-tika-whats-new-with-20-nick-burch-apache-software-foun

[jira] [Commented] (TIKA-1867) Tika external parsers cannot be turned off without patching the tika-app-XX.jar

2017-05-10 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005441#comment-16005441 ] Nick Burch commented on TIKA-1867: -- I've just tried with your config file and the Tika App. I'm seeing

[jira] [Resolved] (TIKA-2353) How to fetch document creator/author/last-modified

2017-05-03 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-2353. -- Resolution: Invalid {{grep}} ? However, please don't use JIRA for asking usage questions. Please direct

[jira] [Commented] (TIKA-2351) Getting error while parsing documents

2017-05-02 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992896#comment-15992896 ] Nick Burch commented on TIKA-2351: -- Just {{java -jar tika-app-1.15-snapshot.jar --text problem.doc

[jira] [Commented] (TIKA-2351) Getting error while parsing documents

2017-05-02 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992845#comment-15992845 ] Nick Burch commented on TIKA-2351: -- I've just tried with a recent nightly build, and no error was reported

[jira] [Commented] (TIKA-2351) Getting error while parsing documents

2017-05-02 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992821#comment-15992821 ] Nick Burch commented on TIKA-2351: -- Can you attach the failing document? If not, could you try grabbing

[jira] [Commented] (TIKA-2346) Allow Office format parsers to exclude parsing shapes

2017-04-28 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988951#comment-15988951 ] Nick Burch commented on TIKA-2346: -- Thanks Tim! I think we probably don't want it for PPT / PPTX otherwise

[jira] [Resolved] (TIKA-2346) Allow Office format parsers to exclude parsing shapes

2017-04-27 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-2346. -- Resolution: Fixed Implemented in aa4954fb44f707779693faea785acc219739ccd5

[jira] [Created] (TIKA-2346) Allow Office format parsers to exclude parsing shapes

2017-04-27 Thread Nick Burch (JIRA)
Nick Burch created TIKA-2346: Summary: Allow Office format parsers to exclude parsing shapes Key: TIKA-2346 URL: https://issues.apache.org/jira/browse/TIKA-2346 Project: Tika Issue Type

[jira] [Resolved] (TIKA-2345) TikaConfigSerializer should expose EncodingDetector details

2017-04-27 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-2345. -- Resolution: Fixed Implemented Note that ExecutorService still requires serialisation to have a complete

[jira] [Created] (TIKA-2345) TikaConfigSerializer should expose EncodingDetector details

2017-04-27 Thread Nick Burch (JIRA)
Nick Burch created TIKA-2345: Summary: TikaConfigSerializer should expose EncodingDetector details Key: TIKA-2345 URL: https://issues.apache.org/jira/browse/TIKA-2345 Project: Tika Issue Type

[jira] [Commented] (TIKA-2099) Tar files without magic bytes are sporadically detected as text

2017-04-26 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984479#comment-15984479 ] Nick Burch commented on TIKA-2099: -- [~talli...@mitre.org] has been doing some work on Commons Compress

[jira] [Resolved] (TIKA-2327) Turn off MATLAB file parsing

2017-04-16 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-2327. -- Resolution: Information Provided Step 1 - upgrade to a more recent version of Apache Tika Step 2

Re: [Q] reason for tika-parser-*-bundle to be separated from corresponding parser modules in 2.x

2017-03-29 Thread Nick Burch
On Wed, 29 Mar 2017, Konstantin Gribov wrote: I've been surprised by such separation, what was the reason to separate them? I think partly history (we split in 1.x), partly how the split was done (osgi folks amongst the most keen), and partly a desire not to have non-OSGi users getting a

[jira] [Commented] (TIKA-2313) Old Word document (Word 6.0, 1997) has a badly encoded(?) output.

2017-03-28 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945477#comment-15945477 ] Nick Burch commented on TIKA-2313: -- That check may well not be correct for the older formats. I'd start

[jira] [Commented] (TIKA-2313) Old Word document (Word 6.0, 1997) has a badly encoded(?) output.

2017-03-28 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945193#comment-15945193 ] Nick Burch commented on TIKA-2313: -- Opening the document in OpenOffice, it looks to be in French, complete

[jira] [Commented] (TIKA-2311) Create x-tika-ooxml-unk mime type (?)

2017-03-27 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943238#comment-15943238 ] Nick Burch commented on TIKA-2311: -- How about we have package parser say "if no mimetype set or cu

[jira] [Commented] (TIKA-1772) Mimetype of VTT files

2017-03-22 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937437#comment-15937437 ] Nick Burch commented on TIKA-1772: -- Thanks for the test file! I've committed it, along with a similar

[jira] [Reopened] (TIKA-2253) Obtain new Miredot license key and upgrade plugin version in tika-server

2017-03-18 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch reopened TIKA-2253: -- Sadly http://tika.apache.org/1.14/miredot/ and friends remain broken. Could someone who understands miredot

[jira] [Commented] (TIKA-2294) Tika inconsistently detects ooxml files as zip file sometimes

2017-03-10 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904849#comment-15904849 ] Nick Burch commented on TIKA-2294: -- That way of calling Tika doesn't pass in the filename, so it'll

[jira] [Commented] (TIKA-2294) Tika inconsistently detects ooxml files as zip file sometimes

2017-03-09 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902906#comment-15902906 ] Nick Burch commented on TIKA-2294: -- To correctly detect the OOXML sub-type, you either need the filename

Re: Require guidance from where to start contributing in Apache Tika

2017-03-08 Thread Nick Burch
On Thu, 9 Mar 2017, Avtar Singh Mehra wrote: I am new to Apache Tika but have plenty of experience with other Apache Softwares like Apache Solr, Apache Lucene, Apache Velocity etc. I would like to start contributing to Apache Tika community. It would be great help if someone could guide me

[jira] [Commented] (TIKA-2288) Remove metadata within body-element in OutlookExtractor

2017-03-07 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900140#comment-15900140 ] Nick Burch commented on TIKA-2288: -- I've got a feeling that this was partly because we didn't have as-good

Re: Tess4j API for TIKA OCR parser

2017-03-07 Thread Nick Burch
On Tue, 7 Mar 2017, Thejan Wijesinghe wrote: I have already use the Tess4j API to rewrite the TesseractOCRParser class, Although It successfully extracts content from most of the file types, it fails some particular unit tests in the TesseractOCRParserTest class. I can solve that. However, I

[jira] [Commented] (TIKA-2271) Tika parsing gives maximum limit reached error

2017-02-22 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878174#comment-15878174 ] Nick Burch commented on TIKA-2271: -- Why are you setting a character limit on your ContentHandler if you

[jira] [Commented] (TIKA-1332) Create "eval" code

2017-02-16 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870398#comment-15870398 ] Nick Burch commented on TIKA-1332: -- Unless we really need a Lucene 6 feature, for now to avoid surprises

[jira] [Commented] (TIKA-1332) Create "eval" code

2017-02-15 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868028#comment-15868028 ] Nick Burch commented on TIKA-1332: -- Apache Ignite seems to use H2, and a google of H2 + apache.org shows

[jira] [Commented] (TIKA-2241) DumpTikaConfigExample generates strange tika-config.xml

2017-01-17 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826453#comment-15826453 ] Nick Burch commented on TIKA-2241: -- Support added in git in {{320a1f1ede36cf1f62f6f2b8cab468cd78094606

[jira] [Commented] (TIKA-2241) DumpTikaConfigExample generates strange tika-config.xml

2017-01-17 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826325#comment-15826325 ] Nick Burch commented on TIKA-2241: -- Can you please open a fresh bug for the grobid issue? That's unrelated

[jira] [Commented] (TIKA-2241) DumpTikaConfigExample generates strange tika-config.xml

2017-01-17 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825786#comment-15825786 ] Nick Burch commented on TIKA-2241: -- To get the list of mime types listed as supported by each parser

[jira] [Commented] (TIKA-2241) DumpTikaConfigExample generates strange tika-config.xml

2017-01-17 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825729#comment-15825729 ] Nick Burch commented on TIKA-2241: -- You only need to specify a mimetype for a parser if you want to bind

[jira] [Commented] (TIKA-2241) DumpTikaConfigExample generates strange tika-config.xml

2017-01-17 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825700#comment-15825700 ] Nick Burch commented on TIKA-2241: -- The example is no longer the recommended way to generate or test

[jira] [Comment Edited] (TIKA-2194) matlab files detected as 'text/plain'

2017-01-12 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820902#comment-15820902 ] Nick Burch edited comment on TIKA-2194 at 1/12/17 12:38 PM: Ah, I've found

[jira] [Commented] (TIKA-2194) matlab files detected as 'text/plain'

2017-01-12 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820902#comment-15820902 ] Nick Burch commented on TIKA-2194: -- Ah, I've found the problem with your filename case. In the tika

[jira] [Commented] (TIKA-2224) Mime magic for OneNote formats

2016-12-27 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15781782#comment-15781782 ] Nick Burch commented on TIKA-2224: -- They very much are on github! See https://github.com/apache/tika

[jira] [Commented] (TIKA-2224) Mime magic for OneNote formats

2016-12-22 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771750#comment-15771750 ] Nick Burch commented on TIKA-2224: -- Thanks for the test file, I've added it to git and created a unit test

[jira] [Commented] (TIKA-1946) Add mime detection and parser for WordPerfect

2016-12-21 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768945#comment-15768945 ] Nick Burch commented on TIKA-1946: -- Ideally different file formats would have different mimetypes

[jira] [Commented] (TIKA-1946) Add mime detection and parser for WordPerfect

2016-12-21 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768929#comment-15768929 ] Nick Burch commented on TIKA-1946: -- I believe it's only normal to have non-ASF headers for code that we're

[jira] [Commented] (TIKA-2224) Mime magic for OneNote formats

2016-12-21 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768699#comment-15768699 ] Nick Burch commented on TIKA-2224: -- Mime magic now added for `.one` and `.onetoc`. `.onepkg` is actually

[jira] [Created] (TIKA-2224) Mime magic for OneNote formats

2016-12-21 Thread Nick Burch (JIRA)
Nick Burch created TIKA-2224: Summary: Mime magic for OneNote formats Key: TIKA-2224 URL: https://issues.apache.org/jira/browse/TIKA-2224 Project: Tika Issue Type: Improvement

[jira] [Commented] (TIKA-2208) Catch missing libraires

2016-12-16 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753837#comment-15753837 ] Nick Burch commented on TIKA-2208: -- I wonder if we need to put an extra catch

[jira] [Commented] (TIKA-2208) Catch missing libraires

2016-12-14 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15748104#comment-15748104 ] Nick Burch commented on TIKA-2208: -- Rather than doing it in code, what happens if you specify a Tika

[jira] [Commented] (TIKA-2194) matlab files detected as 'text/plain'

2016-12-11 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15741077#comment-15741077 ] Nick Burch commented on TIKA-2194: -- Matlab files lack a unique magic pattern at the start, which makes

Re: FW: ApacheCon Miami is coming in May.

2016-11-30 Thread Nick Burch
On Wed, 30 Nov 2016, Allison, Timothy B. wrote: ApacheCon and Apache Big Data will be held at the Intercontinental in Miami, Florida, May 16-18, 2017 I plan to attend. Who's in? Any idea if there will be another "content" track like we had in Austin? If we want a Content track, then we'd

[jira] [Commented] (TIKA-2183) Can't Read file if its name is Arabic

2016-11-23 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15690279#comment-15690279 ] Nick Burch commented on TIKA-2183: -- Ping [~chrismattmann] (he's the maintainer of those bindings at https

[jira] [Commented] (TIKA-2183) Can't Read file if its name is Arabic

2016-11-23 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15690022#comment-15690022 ] Nick Burch commented on TIKA-2183: -- How are you calling Tika? I'd guess some sort of Python wrapper? If so

<    1   2   3   4   5   6   7   8   9   10   >