[jira] [Commented] (TIKA-3159) Macros not extracted from OpenDocument format Office files (flatXML format)

2020-08-12 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176261#comment-17176261 ] Nick Burch commented on TIKA-3159: -- That wikipedia page states _Office documents that conform

[jira] [Comment Edited] (TIKA-3155) Parse Error while extracting CSV files

2020-08-11 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175415#comment-17175415 ] Nick Burch edited comment on TIKA-3155 at 8/11/20, 9:50 AM: If we can use

[jira] [Commented] (TIKA-3155) Parse Error while extracting CSV files

2020-08-11 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175415#comment-17175415 ] Nick Burch commented on TIKA-3155: -- If we can use quote mode we should, it will make the output from Tika

[jira] [Commented] (TIKA-3153) Text File identified as message/rfc822

2020-08-10 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175043#comment-17175043 ] Nick Burch commented on TIKA-3153: -- We talked about using a regex for simplifying the matching of non

Re: Should we add Apache Commons Lang to tika-core as a dependency?

2020-08-03 Thread Nick Burch
On Mon, 3 Aug 2020, Peter Lee wrote: I'm working with TIKA-3141 recently and pushed a PR in github. As Keith suggested in the PR, maybe we should add Commons Lang to tika-core, as it seems Commons Lang are being used elsewhere in tika but not tika-core. Historically, we have tried to keep

[jira] [Commented] (TIKA-3144) Detecting hprof memory dump files exported from Android Studio

2020-07-30 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17167819#comment-17167819 ] Nick Burch commented on TIKA-3144: -- Generally you need to use the {{x-}} prefix on the subtype to mark

[jira] [Commented] (TIKA-3141) LINUX - Tika shouldn't throw an exception for an empty TIKA_CONFIG environment variable value

2020-07-28 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166439#comment-17166439 ] Nick Burch commented on TIKA-3141: -- Unsetting the environment variable seems like the right way to handle

[jira] [Commented] (TIKA-3144) Detecting hprof memory dump files exported from Android Studio

2020-07-28 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166300#comment-17166300 ] Nick Burch commented on TIKA-3144: -- After a quick google, I can't seem to find any canonical or even

[jira] [Commented] (TIKA-3121) Rename master branch

2020-07-09 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154600#comment-17154600 ] Nick Burch commented on TIKA-3121: -- Don't think so, I think we need to ask infra to make the change

[jira] [Commented] (TIKA-3115) Detect parquet files

2020-07-07 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153235#comment-17153235 ] Nick Burch commented on TIKA-3115: -- The Avro metadata files seem to be JSON, so not much hope

Re: datasette is live

2020-06-24 Thread Nick Burch
On Wed, 24 Jun 2020, Tim Allison wrote: Thank you, Maruan! I’ll open a ticket w datasette. Would a ProxyPassReverse work for this? Nick

[jira] [Commented] (TIKA-3104) Detection of memgraph files exported from Xcode

2020-06-24 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143776#comment-17143776 ] Nick Burch commented on TIKA-3104: -- Any chance you could create / find a small XML Memgraph file for us

Re: Request for access to edit the ASF Tika wiki

2020-06-22 Thread Nick Burch
On Mon, 22 Jun 2020, Vegard Stikbakke wrote: I would like to update outdated installation instructions here: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR Specifically, installation on Mac. So I'm kindly requesting access to edit! Can you please create yourself an account on our

Try datasette for browsing corpa sql reports?

2020-06-17 Thread Nick Burch
Hi All As I understand it (which might be wrong!), Tim is generating a bunch of reports on things in the corpa / how different tools analyse the corpa / how Tika works on the stuff there, mostly as SQL databases Those databases are then available to anyone who is interest to download and

[jira] [Commented] (TIKA-3113) Currently Tika is detecting a .aux file as text/html

2020-06-15 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135820#comment-17135820 ] Nick Burch commented on TIKA-3113: -- Any ideas on this scientific-looking format [~lewismc

Re: Corpora server setup

2020-06-15 Thread Nick Burch
On Mon, 15 Jun 2020, Maruan Sahyoun wrote: browsing is now available from https://corpora.tika.apache.org/base/ Let me know what you think or if it doesn't work for you. Is it worth adding a header and/or footer to the auto-index pages, to explain what is there + where to get more details?

[jira] [Commented] (TIKA-3113) Currently Tika is detecting a .aux file as text/html

2020-06-11 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133944#comment-17133944 ] Nick Burch commented on TIKA-3113: -- I'm not sure what this is, but I'm fairly sure it isn't latex

Mime type magic and repeated similar blocks - thoughts?

2020-06-09 Thread Nick Burch
Hi All At the moment, to detect RFC822 emails, we try and check for a bunch of common header lines right at the start. If not, we check for a few "could be an unusual header, could be some text", followed by checking for common headers in a larger area of text below. For example, starts

Re: Problem in resolving tika parser in Gradle projects

2020-06-05 Thread Nick Burch
On Thu, 4 Jun 2020, Dupinder Singh wrote: My project is gralde based, so I was trying to resolve the build as you described in your documentations, but this is not resolving the dependency. dependencies { runtime 'org.apache.tika:tika-parsers:1.24.1' } That looks like it ought to be fine,

Re: Fwd: New mailing list queued for creation: corpora-...@tika.apache.org

2020-06-05 Thread Nick Burch
On Thu, 4 Jun 2020, Tim Allison wrote: Following guidance from https://issues.apache.org/jira/browse/INFRA-20376, I've requested a corpora-...@tika.apache.org mail list. If we need separate user/private, we can request those. Let me know. I don't think we need user or private at this stage -

[jira] [Commented] (TIKA-3106) Tika Fails to detect some EML files if extension is not .eml

2020-06-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126443#comment-17126443 ] Nick Burch commented on TIKA-3106: -- You ought to be able to point your gradle build at the snapshots repo

[jira] [Commented] (TIKA-3107) AutoDetectParser.parse failed with error "Initialisation of record 0x85(BoundSheetRecord) left 28 bytes remaining still to be read"

2020-06-04 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126324#comment-17126324 ] Nick Burch commented on TIKA-3107: -- This is a bug in Apache POI, one of the libraries that Tika depends

[jira] [Commented] (TIKA-3106) Tika Fails to detect some EML files if extension is not .eml

2020-06-03 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125545#comment-17125545 ] Nick Burch commented on TIKA-3106: -- This email starts with a series of long {{ARC-}} headers, which means

[jira] [Commented] (TIKA-3104) Detection of memgraph files exported from Xcode

2020-06-03 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125536#comment-17125536 ] Nick Burch commented on TIKA-3104: -- A mimetype of {{application/x-itunes-bplist}} seems a sensible choice

[jira] [Commented] (TIKA-3104) Detection of memgraph files exported from Xcode

2020-06-03 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124936#comment-17124936 ] Nick Burch commented on TIKA-3104: -- Yup! https://github.com/apache/tika/blob/master/tika-parsers/src

[jira] [Commented] (TIKA-3105) OFT format detection based on file name (extension) instead of file content

2020-06-03 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124934#comment-17124934 ] Nick Burch commented on TIKA-3105: -- At a quick glance, that first 4 bytes isn't unique-enough. There look

[jira] [Commented] (TIKA-3104) Detection of memgraph files exported from Xcode

2020-06-02 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124614#comment-17124614 ] Nick Burch commented on TIKA-3104: -- At this point, volunteer-permitting, I think we could now also write

[jira] [Commented] (TIKA-3104) Detection of memgraph files exported from Xcode

2020-05-28 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119272#comment-17119272 ] Nick Burch commented on TIKA-3104: -- There's an unmaintained but suitably licensed bplist parser in Java

[jira] [Updated] (TIKA-3104) Detection of memgraph files exported from Xcode

2020-05-28 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch updated TIKA-3104: - Attachment: memgraph.xml > Detection of memgraph files exported from Xc

[jira] [Commented] (TIKA-3104) Detection of memgraph files exported from Xcode

2020-05-28 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118364#comment-17118364 ] Nick Burch commented on TIKA-3104: -- You can't currently parse the files, only detect them. Parsing

[jira] [Commented] (TIKA-3104) Detection of memgraph files exported from Xcode

2020-05-28 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118359#comment-17118359 ] Nick Burch commented on TIKA-3104: -- {{bplist}} is an Apple file format for storing property listings

[jira] [Commented] (TIKA-3104) Detection of memgraph files exported from Xcode

2020-05-28 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118352#comment-17118352 ] Nick Burch commented on TIKA-3104: -- At some point we might want to add a dedicated bplist detector

[jira] [Commented] (TIKA-3104) Detection of memgraph files exported from Xcode

2020-05-27 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118332#comment-17118332 ] Nick Burch commented on TIKA-3104: -- Looks like these are based off the bplist format. Not sure if we can

[jira] [Commented] (TIKA-2961) Tika 在识别以caff开始的txt文档时会把它错误地识别为audio/x-caf 音频类型

2020-05-17 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17109825#comment-17109825 ] Nick Burch commented on TIKA-2961: -- Based on [https://developer.apple.com/library/archive/documentation

Re: Issue with > 200% CPU after bulk usage

2020-04-16 Thread Nick Burch
On Wed, 15 Apr 2020, hans.mei...@avident-it.se wrote: I have encountered an issue with Tika running locally on a box that the Java runtime goes up to over 200% CPU, after running a bulk load of documents over a couple of days, it is more than 3 million documents. Can you do a thread dump to

[jira] [Commented] (TIKA-3089) Text should be wrapped in pre-tags instead of in p-tags

2020-04-14 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083096#comment-17083096 ] Nick Burch commented on TIKA-3089: -- Since several parsers need changing... Maybe a new kind of `Config

Re: Tika master branch not building

2020-04-06 Thread Nick Burch
On Mon, 6 Apr 2020, Eric Pugh wrote: Maybe this needs better documentation, however this is a “works as designed” feature! To avoid the build failing, run mvn package -Dossindex.fail=false Should we maybe have this set to false by default, and only enabled on release builds? (We shouldn't

[jira] [Commented] (TIKA-3072) Seeing org.apache.tika.exception.TikaException: Unexpected RuntimeException for an XLS file

2020-03-16 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060277#comment-17060277 ] Nick Burch commented on TIKA-3072: -- I have just tried your file with the latest version of Apache Tika

[jira] [Commented] (TIKA-2714) Tika Parse Errors for certain attachments

2020-03-11 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057039#comment-17057039 ] Nick Burch commented on TIKA-2714: -- Seems good to me. Since we know the magic for v4, we can add

[jira] [Commented] (TIKA-2714) Tika Parse Errors for certain attachments

2020-03-11 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057022#comment-17057022 ] Nick Burch commented on TIKA-2714: -- >From [https://www.rarlab.com/technote.htm] h3. RAR

[jira] [Commented] (TIKA-3063) Tika parser / POI crash with IndexOutOfBoundsException error

2020-03-08 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054431#comment-17054431 ] Nick Burch commented on TIKA-3063: -- Based on the error message, it looks like the file is either

[jira] [Commented] (TIKA-3043) vorbis-java-tika overwrites tika's Parser and Detector in MANIFEST

2020-02-13 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036421#comment-17036421 ] Nick Burch commented on TIKA-3043: -- If you are building an all-in-one jar, you need to merge certain

[jira] [Resolved] (TIKA-3023) Text files starting with MOVI are detected as X-SGI-Movie

2020-02-06 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-3023. -- Fix Version/s: 1.24 Resolution: Fixed > Text files starting with MOVI are detected as X-

[jira] [Commented] (TIKA-3023) Text files starting with MOVI are detected as X-SGI-Movie

2020-02-06 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031507#comment-17031507 ] Nick Burch commented on TIKA-3023: -- The FFMpeg project have some sample SGI Movie files at [https

[jira] [Commented] (TIKA-3037) Tika Docs should highlight Tika-Server

2020-02-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030753#comment-17030753 ] Nick Burch commented on TIKA-3037: -- The website is generated with Maven, source code at https

[jira] [Commented] (TIKA-3034) Detector always returns text/plain when scanning Mathematica files

2020-02-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030686#comment-17030686 ] Nick Burch commented on TIKA-3034: -- We tend to do 3ish releases a year. Last release was in December, so

[jira] [Commented] (TIKA-3034) Detector always returns text/plain when scanning Mathematica files

2020-02-04 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029736#comment-17029736 ] Nick Burch commented on TIKA-3034: -- Mathematica does have a fairly unusual start-of-comment structure, so

[jira] [Commented] (TIKA-3034) Detector always returns text/plain when scanning Mathematica files

2020-01-31 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027621#comment-17027621 ] Nick Burch commented on TIKA-3034: -- Can you try and pass the filename along with the contents when you

[jira] [Commented] (TIKA-3031) NumberFormatException while parsing a certain PDF document

2020-01-29 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025810#comment-17025810 ] Nick Burch commented on TIKA-3031: -- This looks like an underlying Apache PDFBox bug to me

[jira] [Commented] (TIKA-3030) XLS files with a root node named WORKBOOK don't get parsed

2020-01-28 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025586#comment-17025586 ] Nick Burch commented on TIKA-3030: -- Pretty sure we've got a test file in Apache POI like this - some

[jira] [Commented] (TIKA-3028) Failed test at SAS7BDATParserTest:112

2020-01-27 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024895#comment-17024895 ] Nick Burch commented on TIKA-3028: -- The formatting of the raw values into nice strings is handled

Re: Feature to extract duration of an AMR file

2020-01-27 Thread Nick Burch
On Mon, 27 Jan 2020, Saurabh Bhardwaj wrote: Currently, Tika is able to figure out whether given file is AMR file or not but doesn't return one of the most useful information for an AMR file i.e. its duration. Generally that means we have mime-magic for detection, but don't have a parser for

[jira] [Commented] (TIKA-2294) Tika inconsistently detects ooxml files as zip file sometimes

2020-01-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018947#comment-17018947 ] Nick Burch commented on TIKA-2294: -- For fully accurate OOXML (and other zip-subtype) detection, you need

[jira] [Commented] (TIKA-3023) Text files starting with MOVI are detected as X-SGI-Movie

2020-01-09 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011634#comment-17011634 ] Nick Burch commented on TIKA-3023: -- Assuming that the byte after MOVI is part of a version or length

[jira] [Commented] (TIKA-3007) Heic images are detected as "application/mp4" when using tika as server

2019-12-17 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998213#comment-16998213 ] Nick Burch commented on TIKA-3007: -- See [https://cwiki.apache.org/confluence/display/TIKA

[jira] [Commented] (TIKA-3007) Heic images are detected as "application/mp4" when using tika as server

2019-12-17 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998112#comment-16998112 ] Nick Burch commented on TIKA-3007: -- There is currently no Parser for HEIC files, only mime detection

[jira] [Commented] (TIKA-3007) Heic images are detected as "application/mp4" when using tika as server

2019-12-13 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995525#comment-16995525 ] Nick Burch commented on TIKA-3007: -- Mime magic detection is all in Tika Core, so there shouldn't be any

[jira] [Commented] (TIKA-3009) XML Parser reset() detection no working in weblogic 12.2.1.3

2019-12-12 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994553#comment-16994553 ] Nick Burch commented on TIKA-3009: -- That sounds like a "fun" WebLogic bug... Would calling r

[jira] [Commented] (TIKA-2929) tika-parsers not usable on module path (Java 11)

2019-12-06 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989907#comment-16989907 ] Nick Burch commented on TIKA-2929: -- At the moment, Apache Tika needs to be on the Java Classpath

[jira] [Commented] (TIKA-2912) Add parser for protobufs

2019-12-06 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989856#comment-16989856 ] Nick Burch commented on TIKA-2912: -- See also https://github.com/protobufjs/protobuf.js/wiki/How

[jira] [Commented] (TIKA-2830) Detect Media type of HEIF file correctly

2019-12-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988981#comment-16988981 ] Nick Burch commented on TIKA-2830: -- I think we might have solved some of this with TIKA-2942, would you

Re: Call for Microsoft OneNote experts for help on OneNote parsing in Tika

2019-11-27 Thread Nick Burch
On Sun, 24 Nov 2019, Nicholas DiPiazza wrote: Basically I just need some help understanding some of the finer details of the OneNote format and how to extract info from it. https://stackoverflow.com/questions/59008205/onenote-parsing-how-to-get-to-the-text-blobs-in-the-document

Re: [EXTERNAL] Docker image along with 1.23?

2019-11-21 Thread Nick Burch
On Thu, 21 Nov 2019, Oleg Tikhonov wrote: My question is more pragmatic. What we put inside the Dockerfile, on which image it will be based on (say Ubuntu) ... What will contain an entrypoint? Tika Server? Should we "install" a tesseract? Anything more? If we want to be trendy, then Sergey

Re: Docker image along with 1.23?

2019-11-20 Thread Nick Burch
On Wed, 20 Nov 2019, Tim Allison wrote: Eric Pugh recently asked on another channel if we had any plans to release an official docker image for 1.23. Depending on what we put in the container, we do need to be a little careful. There's "platform dependencies" under non-compatible licenses

[jira] [Commented] (TIKA-2992) java.lang.UnsupportedOperationException: This feature requires ASM7 in Tika 1.21

2019-11-20 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978265#comment-16978265 ] Nick Burch commented on TIKA-2992: -- Most likely you have an older version of ASM on your classpath which

[jira] [Commented] (TIKA-2986) Edge case (?) in file type detection

2019-11-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977661#comment-16977661 ] Nick Burch commented on TIKA-2986: -- Based on the current {{Detector}} and {{DefaultDetector

[jira] [Commented] (TIKA-2986) Edge case (?) in file type detection

2019-11-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977615#comment-16977615 ] Nick Burch commented on TIKA-2986: -- How do we know which ones are a _must_ though? Many we expect

[jira] [Commented] (TIKA-2988) Add mime for alternative fdf format

2019-11-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977613#comment-16977613 ] Nick Burch commented on TIKA-2988: -- If it's not an official one, I believe we're supposed to prefix

[jira] [Commented] (TIKA-2986) Edge case (?) in file type detection

2019-11-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977314#comment-16977314 ] Nick Burch commented on TIKA-2986: -- Maybe we could add a second mode to Detect for this case? Current

[jira] [Commented] (TIKA-2986) Edge case (?) in file type detection

2019-11-18 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976851#comment-16976851 ] Nick Burch commented on TIKA-2986: -- Based on [https://cwiki.apache.org/confluence/display/TIKA

[jira] [Comment Edited] (TIKA-2224) OneNote formats support - Mime Magic and Parser

2019-11-18 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976528#comment-16976528 ] Nick Burch edited comment on TIKA-2224 at 11/18/19 4:22 PM: The Tika Parsers

[jira] [Updated] (TIKA-2224) OneNote formats support - Mime Magic and Parser

2019-11-18 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch updated TIKA-2224: - Summary: OneNote formats support - Mime Magic and Parser (was: Mime magic for OneNote formats

[jira] [Commented] (TIKA-2986) Edge case (?) in file type detection

2019-11-18 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976670#comment-16976670 ] Nick Burch commented on TIKA-2986: -- I seem to recall that we allow the filename only to win for things

[jira] [Resolved] (TIKA-2942) HEIC files are detected as "video/quicktime" media type

2019-11-18 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-2942. -- Fix Version/s: 1.23 Resolution: Fixed > HEIC files are detected as "video/quicktime&quo

[jira] [Commented] (TIKA-2942) HEIC files are detected as "video/quicktime" media type

2019-11-18 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976609#comment-16976609 ] Nick Burch commented on TIKA-2942: -- Nokia have produced a Java library for the file format - [https

[jira] [Commented] (TIKA-2224) Mime magic for OneNote formats

2019-11-18 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976528#comment-16976528 ] Nick Burch commented on TIKA-2224: -- The Tika Parsers project depends on Guava, currently `28.1-jre` Feel

[jira] [Commented] (TIKA-2982) Tika 识别已加密的xlsx、docx、pptx时会把它们错误地识别成doc

2019-11-12 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973054#comment-16973054 ] Nick Burch commented on TIKA-2982: -- If it's at the bottom of the if block, I think it's fine to drop

[jira] [Commented] (TIKA-2942) HEIC files are detected as "video/quicktime" media type

2019-11-11 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971490#comment-16971490 ] Nick Burch commented on TIKA-2942: -- Do you have a small sample file that you can share with us? We

[jira] [Commented] (TIKA-2972) Allow users to specify a list/map of ContentHandlerFactories in tika-config.xml

2019-11-02 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965278#comment-16965278 ] Nick Burch commented on TIKA-2972: -- I see the "send the results to a remote network service&q

Re: Grant write access to our wiki to Eric Pugh

2019-10-31 Thread Nick Burch
On Wed, 30 Oct 2019, Eric Pugh wrote: I’ve been going through the Wiki a lot over the past three months, and I’d love to go through and clean out/update the old content. Wonderful, thanks! In case you're also feeling keen, the source for the website is

[jira] [Commented] (TIKA-2972) Allow users to specify a list/map of ContentHandlerFactories in tika-config.xml

2019-10-31 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963956#comment-16963956 ] Nick Burch commented on TIKA-2972: -- It doesn't quite feel like a perfect solution, but I can't think

Re: Grant write access to our wiki to Eric Pugh

2019-10-29 Thread Nick Burch
On Tue, 29 Oct 2019, Tim Allison wrote: Anyone object if I grant write access to our wiki to Eric Pugh. He slacked me a request. I'd almost be tempted to say that we should grant access to all ASF Committers to our wiki. (Note - not all confluence users, as that includes fresh spamy

Re: build failure in master

2019-09-19 Thread Nick Burch
On Wed, 18 Sep 2019, Dan Becker wrote: I am trying to build the master branch from Ubuntu 18.04, but I am getting the following error: [ERROR] Tests run: 11, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.409 s <<< FAILURE! - in org.apache.tika.server.UnpackerResourceTest [ERROR]

Re: Setting eol-style to native on the website files?

2019-09-18 Thread Nick Burch
On Wed, 18 Sep 2019, Tim Allison wrote: I'm good w '\n'. I think the issue is that the mvn tooling might not be if you're on something other than linux/bsd. It seems, as best as I can tell, to create everything in native line endings no matter what the input files are in. (I can't spot any

[jira] [Commented] (TIKA-2947) Following Tika documentation results in a build of Tika version 1.12.

2019-09-18 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932475#comment-16932475 ] Nick Burch commented on TIKA-2947: -- So, it turns out that that page is auto-generated, which is why I

[jira] [Resolved] (TIKA-2947) Following Tika documentation results in a build of Tika version 1.12.

2019-09-18 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-2947. -- Fix Version/s: 1.23 Resolution: Fixed > Following Tika documentation results in a build of T

Setting eol-style to native on the website files?

2019-09-18 Thread Nick Burch
Hi All I've just done a build of the website for TIKA-2947, and most of the files changed. From a quick look, it seems to just be line endings though Currently, the source APT files and the output HTML files don't have any line endings set in svn. I'm tempted to set the eol style on all

[jira] [Commented] (TIKA-2947) Following Tika documentation results in a build of Tika version 1.12.

2019-09-18 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932424#comment-16932424 ] Nick Burch commented on TIKA-2947: -- I'd be tempted to update that link (and the same in older versions

Re: revoking signing key

2018-12-04 Thread Nick Burch
On Tue, 4 Dec 2018, Tim Allison wrote: I had to revoke my signing key: EF0CF38A. I have a couple of leads, but if you know of anyone in the Washington, DC region who might be interested in signing my new key (944FFD51), let me know. Send a message to party@ and suggest an after-work Apache

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-08 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680404#comment-16680404 ] Nick Burch commented on TIKA-2771: -- I'm not sure we do. We have documents along with the encoding

[jira] [Commented] (TIKA-2765) Regression extracting text from corrupted docx files

2018-10-24 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662467#comment-16662467 ] Nick Burch commented on TIKA-2765: -- Oracle hid all the useful Zip security stuff in recent Java releases

[jira] [Commented] (TIKA-2744) rss+xml doesnt accept files with .xml extension

2018-10-18 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655229#comment-16655229 ] Nick Burch commented on TIKA-2744: -- Nope, it doesn't work that way. All RSS files are XML files

[jira] [Commented] (TIKA-2744) rss+xml doesnt accept files with .xml extension

2018-10-18 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655084#comment-16655084 ] Nick Burch commented on TIKA-2744: -- {{application/rss+xml}} is a subtype of {{application/xml}} so

[jira] [Commented] (TIKA-2744) rss+xml doesnt accept files with .xml extension

2018-10-17 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653808#comment-16653808 ] Nick Burch commented on TIKA-2744: -- I've added a test RSS 2.0 file to Tika's test documents, and it's

[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format

2018-10-17 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653804#comment-16653804 ] Nick Burch commented on TIKA-2543: -- Great find Tim! Looks like an excellent resource on this. Assuming

[jira] [Commented] (TIKA-2752) Tika-App RTFParser crashes with NullPointerException

2018-10-13 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16648881#comment-16648881 ] Nick Burch commented on TIKA-2752: -- Based on https://wiki.apache.org/tika/ErrorsAndExceptions , I'd say

[jira] [Commented] (TIKA-2747) Expose custom MAPI properties as a result of the OutlookExtractor metadata

2018-10-05 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16639748#comment-16639748 ] Nick Burch commented on TIKA-2747: -- We'll certainly need a sample file with some of these properties

[jira] [Updated] (TIKA-2747) Expose custom MAPI properties as a result of the OutlookExtractor metadata

2018-10-05 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch updated TIKA-2747: - Priority: Minor (was: Blocker) > Expose custom MAPI properties as a result of the OutlookExtrac

[jira] [Commented] (TIKA-2734) Tika addes extra characters at the end of text in extracting from excel file

2018-09-25 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627544#comment-16627544 ] Nick Burch commented on TIKA-2734: -- That looks like the print page footer, could it be that? > T

Re: 1.19.1?

2018-09-24 Thread Nick Burch
On Mon, 24 Sep 2018, Tim Allison wrote: Aside from the problem with users and non-standard XML parsers, were there any other show-stoppers in POI 4.0.0? Is there a reason to wait for POI 4.0.1? I think, in terms of Tika affecting bugs, it was the xml parser stuff, and commons compress

<    1   2   3   4   5   6   7   8   9   10   >