Re: wiki editor access request

2022-01-07 Thread Nick Burch
On Fri, 7 Jan 2022, Josh Burchard wrote: I wrote to Tim about making a small update to https://cwiki.apache.org/confluence/display/TIKA/TikaServerEndpointsCompared and he suggested that I email this dev list to see if someone could grant me editor access. Is that a possibility? Can you sign up

Re: [DISCUSS] upgrading log4j to to log4j2 in Tika's 1.x branch

2021-12-15 Thread Nick Burch
On Wed, 15 Dec 2021, Tim Allison wrote: Sounds good, Nick. Unless there are objections, I'll add an EOL September 30, 2022 for the 1.x branch on our github README and maybe our site somewhere? Maybe just mention it in the news section at the end any 1.x fix releases? Nick

Re: [DISCUSS] upgrading log4j to to log4j2 in Tika's 1.x branch

2021-12-15 Thread Nick Burch
On Wed, 15 Dec 2021, Tim Allison wrote: I think we should keep the 1.x branch open for security upgrades for a bit...middle of next year? I have _not_ been adding new features or even some bug fixes to 1.x, and I encourage people to migrate to 2.x. We've seen quite a few queries from people

[jira] [Commented] (TIKA-3590) OSX DMG files wrong MIME type detection (wrong MediaType and Supertype)

2021-11-16 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444644#comment-17444644 ] Nick Burch commented on TIKA-3590: -- [~salmira] Are you able to create us a few sample dmg files to test

[jira] [Commented] (TIKA-3582) Tika does not respect a configuration value passed over a HTTP Header

2021-10-26 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434493#comment-17434493 ] Nick Burch commented on TIKA-3582: -- Bit fiddly, but how about a config option on the server

[jira] [Commented] (TIKA-3570) LYR file detection

2021-10-13 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17428246#comment-17428246 ] Nick Burch commented on TIKA-3570: -- [~delmaestro_l] Does that sample file load in the program

[jira] [Commented] (TIKA-3570) LYR file detection

2021-10-12 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17427920#comment-17427920 ] Nick Burch commented on TIKA-3570: -- Do you have a small sample file that you can share with us, ideally

[jira] [Commented] (TIKA-3559) Add MIME type for .webmanifest files

2021-09-22 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418531#comment-17418531 ] Nick Burch commented on TIKA-3559: -- I'm not sure if the example in the spec is under a suitable license

[jira] [Commented] (TIKA-3559) Add MIME type for .webmanifest files

2021-09-22 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418505#comment-17418505 ] Nick Burch commented on TIKA-3559: -- As we get more JSON-based formats, I wonder if we should do

[jira] [Commented] (TIKA-3558) vulnerability detected in vorbis-tika-java

2021-09-21 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418157#comment-17418157 ] Nick Burch commented on TIKA-3558: -- That seems to be a vulnerability in the libflac C code, so shouldn't

[jira] [Commented] (TIKA-3554) Detect plain text file as application/zip based on file ext wrong

2021-09-16 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416044#comment-17416044 ] Nick Burch commented on TIKA-3554: -- Just to emphasise what Tim has written, file type detection in Apache

[jira] [Commented] (TIKA-3554) Detect plain text file as application/zip based on file ext wrong

2021-09-16 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416009#comment-17416009 ] Nick Burch commented on TIKA-3554: -- If possible, wrap your {{InputStream}} as a {{TikaInputStream

[jira] [Commented] (TIKA-3555) Eset antivirus found threat in the GitHub repo after Git clone

2021-09-15 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415782#comment-17415782 ] Nick Burch commented on TIKA-3555: -- Doesn't that make us look more dodgy, and more likely to trigger

[jira] [Commented] (TIKA-3554) Detect plain text file as application/zip based on file ext wrong

2021-09-15 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415460#comment-17415460 ] Nick Burch commented on TIKA-3554: -- If you want Apache Tika to do detection only on the file contents

[jira] [Commented] (TIKA-3555) Eset antivirus found threat in the GitHub repo after Git clone

2021-09-15 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415439#comment-17415439 ] Nick Burch commented on TIKA-3555: -- See TIKA-259 This file will make an underpowered computer unhappy

[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-08 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411814#comment-17411814 ] Nick Burch commented on TIKA-3544: -- Apache POI provides the DataFormatter class which attempts to turn

[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-08 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411774#comment-17411774 ] Nick Burch commented on TIKA-3544: -- You need to be aware that Excel itself only stored numbers-as-numbers

[jira] [Commented] (TIKA-3534) Latest Android Studio will fail building Android project with Tika Core 2.0.0 included - issues with MethodHandle API usage

2021-08-22 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17402788#comment-17402788 ] Nick Burch commented on TIKA-3534: -- This class is used by the bits of Apache Tika (mostly parsers

[jira] [Commented] (TIKA-3528) WMV file detected as WMA (audio/x-ms-wma)

2021-08-18 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17400934#comment-17400934 ] Nick Burch commented on TIKA-3528: -- The specification document from Microsoft documents the following

[jira] [Commented] (TIKA-3528) WMV file detected as WMA (audio/x-ms-wma)

2021-08-18 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17400920#comment-17400920 ] Nick Burch commented on TIKA-3528: -- Currently we detect to the video format based on the overall

Re: versions?

2021-08-11 Thread Nick Burch
On Wed, 11 Aug 2021, Tim Allison wrote: A) I think we should maintain the 1.x branch and continue to put out bug fixes for a bit. Any objections to nominally calling the next release 1.27.1 on JIRA at least? I agree we should probably try to keep 1.x going for at least a few months, to

Kaitai - might be worth trying for new formats

2021-08-09 Thread Nick Burch
Hi All I came across Kaitai - http://kaitai.io/ - yesterday. Based on the experiences documented in this twitter thread on understanding + parsing an embedded filesystem: https://twitter.com/wrongbaud/status/1424380510671880198 Looks like it might be worth a look for if we need to write our

Re: [DISCUSS] Support Elasticsearch in the tika-pipes module?

2021-07-27 Thread Nick Burch
On Mon, 26 Jul 2021, Tim Allison wrote: Currently the OpenSearch emitter works with the 7.x version of Elasticsearch. Going forward, when the projects diverge: a) do we want to support Elasticsearch and I think we should try, but I'm not sure if it should be "we = Apache Tika" or "we = Tika

[jira] [Commented] (TIKA-3496) Dates should have a timezone?

2021-07-23 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386428#comment-17386428 ] Nick Burch commented on TIKA-3496: -- If there is no timezone stored in the original file, I don't think we

[jira] [Commented] (TIKA-3489) Robots.txt files frequently identified as message/rfc822

2021-07-22 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385514#comment-17385514 ] Nick Burch commented on TIKA-3489: -- I'm not keen on us throwing away information we can easily return

[jira] [Commented] (TIKA-3466) Cannot detect mimetype of xhtml file when script is first node instead of html

2021-07-09 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377953#comment-17377953 ] Nick Burch commented on TIKA-3466: -- [~psakkanan] You really need to be doing some xml parsing

[jira] [Commented] (TIKA-3466) Cannot detect mimetype of xhtml file when script is first node instead of html

2021-07-08 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377317#comment-17377317 ] Nick Burch commented on TIKA-3466: -- I'm happy to add the xmlns version as a match, that seems pretty

[jira] [Commented] (TIKA-3466) Cannot detect mimetype of xhtml file when script is first node instead of html

2021-07-07 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376689#comment-17376689 ] Nick Burch commented on TIKA-3466: -- I've never seen a file that like before, but I'm sure Tim will pop

[jira] [Commented] (TIKA-3445) Extension reading it as eml instead of txt when headers are not present

2021-06-14 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363054#comment-17363054 ] Nick Burch commented on TIKA-3445: -- I think that's an email file, Tika thinks that's an email file, seems

[jira] [Commented] (TIKA-3445) Extension reading it as eml instead of txt when headers are not present

2021-06-14 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362930#comment-17362930 ] Nick Burch commented on TIKA-3445: -- This file does seem to be a series of emails. Checking

[jira] [Commented] (TIKA-3431) Using any setting other than AUTO or NO_OCR for X-Tika-PDFOcrStrategy causes remarkable performance loss

2021-06-02 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355770#comment-17355770 ] Nick Burch commented on TIKA-3431: -- Could this be a PDF where there is a scan + already-OCR'd text

[jira] [Commented] (TIKA-3429) Performance problems partially caused by tika eagerly loading configuration

2021-06-01 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355185#comment-17355185 ] Nick Burch commented on TIKA-3429: -- Most bits of Tika need the mime entries loading, even if you

[jira] [Commented] (TIKA-3421) Obsoleted mime types

2021-05-26 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17352069#comment-17352069 ] Nick Burch commented on TIKA-3421: -- For the obsolete part, how about we follow the pattern of {{text

[jira] [Commented] (TIKA-3421) Obsoleted mime types

2021-05-26 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17352015#comment-17352015 ] Nick Burch commented on TIKA-3421: -- For the specific case of {{message/news}} I think we probably need

[jira] [Commented] (TIKA-3421) Obsoleted mime types

2021-05-26 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17352009#comment-17352009 ] Nick Burch commented on TIKA-3421: -- If a type used to be used, I think we should keep it in Tika. Though

[jira] [Commented] (TIKA-3411) Add image/jxl

2021-05-21 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17349322#comment-17349322 ] Nick Burch commented on TIKA-3411: -- The current Tika matching logic is: * If we have only a filename

[jira] [Commented] (TIKA-3408) Apache Tika 1.26 Metadata for MP4 and MP3.

2021-05-21 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17349149#comment-17349149 ] Nick Burch commented on TIKA-3408: -- Ah, I wonder if that's a bug in the version of the mp4 library used

[jira] [Commented] (TIKA-3411) Add image/jxl

2021-05-21 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17349123#comment-17349123 ] Nick Burch commented on TIKA-3411: -- The 10 byte magic should be fine, even though it's mostly text

[jira] [Commented] (TIKA-3408) Apache Tika 1.26 Metadata for MP4 and MP3.

2021-05-20 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348193#comment-17348193 ] Nick Burch commented on TIKA-3408: -- I'm not sure what you mean by an epoch date here, and I can't see any

[jira] [Commented] (TIKA-3409) provide isBinary/isText method

2021-05-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347750#comment-17347750 ] Nick Burch commented on TIKA-3409: -- I'm not sure if we'd want to put this on MediaTypeRegistry

[jira] [Commented] (TIKA-3408) Apache Tika 1.26 Metadata for MP4 and MP3.

2021-05-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347741#comment-17347741 ] Nick Burch commented on TIKA-3408: -- What date do you think is in the MP3 that you aren't getting? The ID3

[jira] [Comment Edited] (TIKA-3409) provide isBinary/isText method

2021-05-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347494#comment-17347494 ] Nick Burch edited comment on TIKA-3409 at 5/19/21, 11:34 AM: - As well

[jira] [Commented] (TIKA-3409) provide isBinary/isText method

2021-05-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347494#comment-17347494 ] Nick Burch commented on TIKA-3409: -- As well as the primary type that Tika detects, also check the aliases

[jira] [Comment Edited] (TIKA-3409) provide isBinary/isText method

2021-05-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347418#comment-17347418 ] Nick Burch edited comment on TIKA-3409 at 5/19/21, 8:47 AM: Do you want

[jira] [Commented] (TIKA-3409) provide isBinary/isText method

2021-05-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347418#comment-17347418 ] Nick Burch commented on TIKA-3409: -- Do you want to know if Apache Tika can parse the file? Or if you

[jira] [Commented] (TIKA-3392) Apache Tika V1.26 doen't work on Android anymore. Issue with org.xml dependencies.

2021-05-12 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343485#comment-17343485 ] Nick Burch commented on TIKA-3392: -- [~tallison] What about the other Tika "own" XML files

[jira] [Commented] (TIKA-3392) Apache Tika V1.26 doen't work on Android anymore. Issue with org.xml dependencies.

2021-05-11 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342898#comment-17342898 ] Nick Burch commented on TIKA-3392: -- Not sure how easy / possible / user friendly this would be, but... my

[jira] [Commented] (TIKA-3373) add "yml" as extension

2021-04-27 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333295#comment-17333295 ] Nick Burch commented on TIKA-3373: -- You can't override a built-in type. For now, just grab the updated

[jira] [Commented] (TIKA-3373) add "yml" as extension

2021-04-27 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333178#comment-17333178 ] Nick Burch commented on TIKA-3373: -- Thanks for that SO post, very helpful to see what people are commonly

[jira] [Commented] (TIKA-3364) PDF Content is extracted twice

2021-04-23 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330827#comment-17330827 ] Nick Burch commented on TIKA-3364: -- I'm not sure if we already have outlines/bookmarks elsewhere in other

[jira] [Commented] (TIKA-3331) Return a more informative error when trying to parse an encrypted file

2021-03-18 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304115#comment-17304115 ] Nick Burch commented on TIKA-3331: -- Almost certainly a GUI bug from what you describe, but possibly also

[jira] [Commented] (TIKA-3328) PDFs detected as matlab

2021-03-18 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304102#comment-17304102 ] Nick Burch commented on TIKA-3328: -- We give a file starting %PDF a high magic priority, but one starting

[jira] [Commented] (TIKA-3331) Return a more informative error when trying to parse an encrypted file

2021-03-18 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304092#comment-17304092 ] Nick Burch commented on TIKA-3331: -- Many parsers will return [http://tika.apache.org/1.25/api/org/apache

[jira] [Resolved] (TIKA-3310) MP4 video detected as application/mp4

2021-03-14 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-3310. -- Fix Version/s: 1.26 2.0 Resolution: Fixed > MP4 video detected as applicat

[jira] [Commented] (TIKA-3310) MP4 video detected as application/mp4

2021-03-14 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17301280#comment-17301280 ] Nick Burch commented on TIKA-3310: -- Thanks for all your help on this [~peterkronenberg] ! > MP4 vi

[jira] [Commented] (TIKA-3316) Illegal IOException processing XPS files

2021-03-14 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17301278#comment-17301278 ] Nick Burch commented on TIKA-3316: -- Mimetype wise, my view, for what it's worth... It depends on how

[jira] [Commented] (TIKA-3318) MP3 parser using wrong xmpDM:duration units (which aren't clearly documented)

2021-03-14 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17301276#comment-17301276 ] Nick Burch commented on TIKA-3318: -- MP3 parser (+tests) updated

[jira] [Resolved] (TIKA-3318) MP3 parser using wrong xmpDM:duration units (which aren't clearly documented)

2021-03-14 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-3318. -- Fix Version/s: 1.26 2.0 Resolution: Fixed > MP3 parser using wr

[jira] [Created] (TIKA-3318) MP3 parser using wrong xmpDM:duration units (which aren't clearly documented)

2021-03-14 Thread Nick Burch (Jira)
Nick Burch created TIKA-3318: Summary: MP3 parser using wrong xmpDM:duration units (which aren't clearly documented) Key: TIKA-3318 URL: https://issues.apache.org/jira/browse/TIKA-3318 Project: Tika

Re: high level parser module names in 2.x

2021-03-10 Thread Nick Burch
On Tue, 9 Mar 2021, Tim Allison wrote: Would this be better? tika-parsers-basic tika-parsers-complex tika-parsers-¯\_(ツ)_/¯ GStreamer has 4 levels of plugins, Base, Good, Ugly and Bad. Descriptions of what qualifies for what at https://gstreamer.freedesktop.org/modules/ . I can see

[jira] [Commented] (TIKA-3310) MP4 video detected as application/mp4

2021-03-08 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297714#comment-17297714 ] Nick Burch commented on TIKA-3310: -- Yup, I'm happy with that, thanks for all the work and the revisions

[jira] [Commented] (TIKA-3310) MP4 video detected as application/mp4

2021-03-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296103#comment-17296103 ] Nick Burch commented on TIKA-3310: -- I think we need to do the loop twice though, once checking major

[jira] [Commented] (TIKA-3310) MP4 video detected as application/mp4

2021-03-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296076#comment-17296076 ] Nick Burch commented on TIKA-3310: -- My worry is, though I don't know if it could happen, is eg major=3g2c

FW: OSS-Fuzz integration

2021-03-05 Thread Nick Burch
Hi All For those who don't follow dev@commons, there's yet another fulling tool on the block! Details below. Looks pretty neat, and is now being used on a few Apache Commons projects, including Commons Compress which we use What do people think about more fuzzing? Worth doing? Or just too

[jira] [Commented] (TIKA-3310) MP4 video detected as application/mp4

2021-03-04 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295342#comment-17295342 ] Nick Burch commented on TIKA-3310: -- Could there be a situation where both a major and a compatible brand

[jira] [Commented] (TIKA-3310) MP4 video detected as application/mp4

2021-03-04 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295333#comment-17295333 ] Nick Burch commented on TIKA-3310: -- FYI There's a few unrelated changes in the pull request, including

[jira] [Commented] (TIKA-3290) Extension reading it as eml instead of txt

2021-02-24 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290041#comment-17290041 ] Nick Burch commented on TIKA-3290: -- [~Vamsi452] You do appear to have mistake a free open source project

Re: load error handler in TikaConfig for 2.x?

2021-02-09 Thread Nick Burch
On Tue, 9 Feb 2021, Tim Allison wrote: Would we just swap to throwing an Exception if a parser can't be found / loaded? Y, that'd be my inclination. Seems ok to me what do we do if someone gives us a Tika Config that references a Parser that doesn't exist? My preference would be to throw

Re: load error handler in TikaConfig for 2.x?

2021-02-09 Thread Nick Burch
On Mon, 8 Feb 2021, Tim Allison wrote: Do we still need the LoadErrorHandler for TikaConfig 2.x? IIRC, we added that so that folks who didn't want a dependency could prevent the loading of the dependency and then silence complaints -- if set to ignore. Would we just swap to throwing an

[jira] [Commented] (TIKA-3294) Usage of "ECB" mode for "AES" is insecure

2021-02-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279750#comment-17279750 ] Nick Burch commented on TIKA-3294: -- This code is reading something that someone else has already

[jira] [Commented] (TIKA-3290) Extension reading it as eml instead of txt

2021-02-02 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277257#comment-17277257 ] Nick Burch commented on TIKA-3290: -- We did some work fairly recently to increase the chances of real

[jira] [Commented] (TIKA-3290) Extension reading it as eml instead of txt

2021-02-02 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276999#comment-17276999 ] Nick Burch commented on TIKA-3290: -- At first glance, this does seem to be a series of emails, so

[jira] [Commented] (TIKA-3282) OneNote Parser breaks non-ASCII Characters

2021-01-27 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272774#comment-17272774 ] Nick Burch commented on TIKA-3282: -- Perfect, thanks for checking! [~tallison] any chance you feel like

[jira] [Commented] (TIKA-3282) OneNote Parser breaks non-ASCII Characters

2021-01-27 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272754#comment-17272754 ] Nick Burch commented on TIKA-3282: -- Thanks for this patch and test file If you have a copy of OneNote

Re: site?

2021-01-18 Thread Nick Burch
On Mon, 18 Jan 2021, Tim Allison wrote: I did only minimal updates to our site so that there's still mostly info about 1.25, javadocs, etc. are still 1.25. I want to make it clear that that is the "production" release. If desired, I can do the full suite of updates for 2.0.0-ALPHA. Let me

[jira] [Commented] (TIKA-3274) Tika 2.0.0 -- Move parser specific metadata out of tika-core to parser modules

2021-01-14 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264705#comment-17264705 ] Nick Burch commented on TIKA-3274: -- Two possible issues with the move spring to mind: * It becomes

Re: datasette configuration fixed

2021-01-12 Thread Nick Burch
On Mon, 11 Jan 2021, Tim Allison wrote: Thanks to a recommendation from a user and the developer of datasette, I configured the proxy correctly so that this now works: https://corpora.tika.apache.org/datasette/ Yey, thanks for tracking that down and getting to the fix! Nick

[jira] [Commented] (TIKA-3267) Method getEnableImageProcessing() in TesseractOCRConfig should be renamed

2021-01-07 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260705#comment-17260705 ] Nick Burch commented on TIKA-3267: -- I have a feeling this may be due to the magic reflection-based stuff

[jira] [Commented] (TIKA-3258) Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0

2021-01-07 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260440#comment-17260440 ] Nick Burch commented on TIKA-3258: -- I can see beginner users, especially non-Java ones using Tika via

[jira] [Commented] (TIKA-3260) Update rotation.py to work with python3 and a more modern matplotlib

2021-01-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258822#comment-17258822 ] Nick Burch commented on TIKA-3260: -- If we can make a script that's valid python 2 + 3, that'd be ideal

[jira] [Commented] (TIKA-3255) Parsing MP3 file with record size > 100000 fails

2020-12-22 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253643#comment-17253643 ] Nick Burch commented on TIKA-3255: -- The 6mb MP3 file seems to be 2.75mb of ID3 tags, which seems pretty

[jira] [Commented] (TIKA-3254) Html font styles missing - doc to html

2020-12-22 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253511#comment-17253511 ] Nick Burch commented on TIKA-3254: -- Tika tries to give you clean, semantically meaningful XHTML

Datasette instance problems when proxied to main site?

2020-12-08 Thread Nick Burch
Hi All I'm having some issues with the datasette instance on the vm. The main table pages are working, but csv/json/queries seem to be giving a 404. Happy - https://corpora.tika.apache.org/datasette/file_profiles/file_profiles Unhappy -

Re: xmpDM:duration - units?

2020-11-30 Thread Nick Burch
On Thu, 19 Nov 2020, Nick Burch wrote: On Thu, 19 Nov 2020, Tim Allison wrote: Looks like 'scale' needs to be taken into consideration? See 1.2.6.9 https://www.adobe.com/content/dam/acom/en/devnet/xmp/pdfs/XMPSDKReleasecc-2020/XMPSpecificationPart2.pdf Ah, yes, check the spec! 1.2.6.5

Re: Tika 2.0.0-ALPHA?

2020-11-30 Thread Nick Burch
On Mon, 30 Nov 2020, Tim Allison wrote: Now that 1.25 is released, I'm going to work on refactoring tika-eval and tika-server shortly. Then add back in the osgi bundle. After that, shall we go with 2.0.0-ALPHA? Seems ok to me, assuming you're happy to do the work! :) Thanks Nick

Re: xmpDM:duration - units?

2020-11-19 Thread Nick Burch
, even though it's a breaking change. Thoughts? Nick On Wed, Nov 18, 2020 at 3:26 PM Nick Burch wrote: Hi All This question promoted by https://stackoverflow.com/q/64888488/685641 Is there / should there be fixed units on the xmpDM:duration metadata property? And if so, what? Currently

xmpDM:duration - units?

2020-11-18 Thread Nick Burch
Hi All This question promoted by https://stackoverflow.com/q/64888488/685641 Is there / should there be fixed units on the xmpDM:duration metadata property? And if so, what? Currently, MP3 seems to use milliseconds via

[jira] [Commented] (TIKA-1735) Unsupported AutoCAD drawing version: AC1027

2020-11-10 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17229349#comment-17229349 ] Nick Burch commented on TIKA-1735: -- It has been a while since I last looked at this parser, and I'd

[jira] [Commented] (TIKA-3218) Wrong comment for method sortLoadedClasses in ServiceLoaderUtils

2020-11-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226820#comment-17226820 ] Nick Burch commented on TIKA-3218: -- I think the idea of this was so that eg Parsers would have user

[jira] [Commented] (TIKA-3211) Junrar does not support Rar5, 7-Zip-JBinding does, so how about implement RarParser using 7-Zip-JBinding?

2020-10-22 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218859#comment-17218859 ] Nick Burch commented on TIKA-3211: -- 7zip is mostly LGPL, so we wouldn't be able to include

Re: Tika 1.25 release date?

2020-10-21 Thread Nick Burch
On Wed, 21 Oct 2020, Alexander Klimetschek wrote: Regarding xmpcore: I would love to help but it‘s a different department :-) If you can use internal contacts to find the people we need to prod / lobby / smile at, that'd be a big help! And/or if you can try to bribe that team with sending

[jira] [Commented] (TIKA-3209) Different between PictureRunMapper in POI and PicturesSource in Tika

2020-10-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216827#comment-17216827 ] Nick Burch commented on TIKA-3209: -- I've taken a look at the code in POI and Tika today, and back when

Re: Fwd: XLSX wrapped in an OLE2 CompObj/Package - should WorkbookFactory handle it?

2020-10-13 Thread Nick Burch
On Tue, 13 Oct 2020, Tim Allison wrote: Ha, y, this file exercises those bits of code: https://github.com/apache/tika/blob/main/tika-parser-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testPPT_oleWorkbook.ppt Nick, does this match the features of the SO question?

Re: Fwd: XLSX wrapped in an OLE2 CompObj/Package - should WorkbookFactory handle it?

2020-10-10 Thread Nick Burch
On Fri, 9 Oct 2020, Tim Allison wrote: Do you think we should follow up on the Tika side? Do we know if we can handle this? I thought we did, but checking POIFSContainerDetector I can't actually see that case covered I think we (Tika) can handle it in a similar way to CompObj Over on

Expected private/secret keys in the source (TIKA-3205)

2020-09-29 Thread Nick Burch
Hey All Just a quick heads-up that for TIKA-3205 I generated a few new small private keys (RSA, DSA, EC) and added them to the parser test documents folder, for unit testing the new mime magics for keys and certificates. They're not protecting or using anything. One automated security

[jira] [Commented] (TIKA-3205) Mime magic for more certificate related formats

2020-09-29 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204054#comment-17204054 ] Nick Burch commented on TIKA-3205: -- Magic added for PEM and DER encoded certificates, and public/private

[jira] [Commented] (TIKA-3195) Inconsistent result of tika.detect(InputStream) and tika.detect(TikaInputStream)

2020-09-14 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195296#comment-17195296 ] Nick Burch commented on TIKA-3195: -- Currently, Tika has container-based detection for OLE2, Zip, Ogg

[jira] [Commented] (TIKA-3195) Inconsistent result of tika.detect(InputStream) and tika.detect(TikaInputStream)

2020-09-11 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194176#comment-17194176 ] Nick Burch commented on TIKA-3195: -- We need to turn those non-stream types into an InputStream, and use

[jira] [Commented] (TIKA-3195) Inconsistent result of tika.detect(InputStream) and tika.detect(TikaInputStream)

2020-09-11 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194067#comment-17194067 ] Nick Burch commented on TIKA-3195: -- This is expected behaviour. Ogg is a container format. It isn't

[jira] [Commented] (TIKA-3193) Add mime detection for avif

2020-09-08 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192276#comment-17192276 ] Nick Burch commented on TIKA-3193: -- That is one interesting blog post! Probably the best I've come across

<    1   2   3   4   5   6   7   8   9   10   >