[jira] [Commented] (TIKA-4223) STL file exported with OpenSCAD not detected correctly

2024-03-26 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830867#comment-17830867 ] Nick Burch commented on TIKA-4223: -- A lot of the early file extension allocations were taken from the

[jira] [Commented] (TIKA-4210) Not able to identify tika extension

2024-03-14 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827017#comment-17827017 ] Nick Burch commented on TIKA-4210: -- The attached file seems to be an RTF file. I'm not sure what a ".mega

[jira] [Commented] (TIKA-4208) OOM error in SAS7BDATParser

2024-03-09 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824965#comment-17824965 ] Nick Burch commented on TIKA-4208: -- I would expect that the json output version would need a bit more

[jira] [Commented] (TIKA-4208) OOM error in SAS7BDATParser

2024-03-08 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824874#comment-17824874 ] Nick Burch commented on TIKA-4208: -- How much heap size do you have allocated? The error suggests that

[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2024-02-12 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816788#comment-17816788 ] Nick Burch commented on TIKA-3784: -- >From [https://datatracker.ietf.org/doc/rfc7292/] it looks like

[jira] [Commented] (TIKA-4148) Support Autodesk Inventor files (.ipt) (.iam) (.ipn) (.idw)

2023-11-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17787608#comment-17787608 ] Nick Burch commented on TIKA-4148: -- For detection of the OLE2 based files, we don't need to find unique

[jira] [Updated] (TIKA-4119) Return media type "text/javascript" instead of "application/javascript to follow RFC-9239

2023-09-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch updated TIKA-4119: - Component/s: mime > Return media type "text/javascript" instead of "application/javascript to > follow

[jira] [Updated] (TIKA-4119) Return media type "text/javascript" instead of "application/javascript to follow RFC-9239

2023-09-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch updated TIKA-4119: - Labels: tika-3x (was: ) > Return media type "text/javascript" instead of "application/javascript to >

[jira] [Commented] (TIKA-4119) Return media type "text/javascript" instead of "application/javascript to follow RFC-9239

2023-08-29 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759921#comment-17759921 ] Nick Burch commented on TIKA-4119: -- I wonder if this is a big enough change around Detection that we

[jira] [Commented] (TIKA-4062) OfflineContentHandler/ContentHandlerDecorator does not provide option for custom error handling

2023-08-02 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750344#comment-17750344 ] Nick Burch commented on TIKA-4062: -- Between holidays and the length of time needed for regression runs +

[jira] [Commented] (TIKA-4064) Update to 2.8.1

2023-07-28 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17748454#comment-17748454 ] Nick Burch commented on TIKA-4064: -- Depends if anyone else on the PMC has the time to be release manager

[jira] [Commented] (TIKA-3948) Require Java 11 in 3.x

2023-07-28 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17748452#comment-17748452 ] Nick Burch commented on TIKA-3948: -- [~solomax] I think the first task is to identify any other areas of

[jira] [Commented] (TIKA-4098) Detection fails on PDF with garbage before header

2023-07-10 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17741578#comment-17741578 ] Nick Burch commented on TIKA-4098: -- The more bytes beyond the start we check for the PDF marker, the more

[jira] [Commented] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml

2023-06-08 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730728#comment-17730728 ] Nick Burch commented on TIKA-4060: -- I'm a muppet... had forgotten to escape the hex characters in the

[jira] [Resolved] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml

2023-06-08 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-4060. -- Fix Version/s: 2.8.1 Resolution: Fixed > Add magic to audio/aac in tika-mimetypes.xml >

[jira] [Commented] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml

2023-06-08 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730649#comment-17730649 ] Nick Burch commented on TIKA-4060: -- 0x494443 is the string ID3, which I think ought to be at the start.

[jira] [Commented] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml

2023-06-07 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730304#comment-17730304 ] Nick Burch commented on TIKA-4060: -- I have created some small test AAC files using ffmpeg, and then had a

[jira] [Commented] (TIKA-4051) Explore new parsers

2023-06-03 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17728992#comment-17728992 ] Nick Burch commented on TIKA-4051: -- Last time I asked the MPXJ project they weren't interested in

[jira] [Commented] (TIKA-3999) audio/xm audio/x-mod

2023-05-23 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725561#comment-17725561 ] Nick Burch commented on TIKA-3999: -- Oh, this brings back memories... good memories :) Unless we can

[jira] [Commented] (TIKA-4045) DBF/MDB row count extraction

2023-05-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724302#comment-17724302 ] Nick Burch commented on TIKA-4045: -- I guess this could also apply for other row-based formats like SQLite

[jira] [Commented] (TIKA-4025) Extract frame count from gifs

2023-05-02 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718674#comment-17718674 ] Nick Burch commented on TIKA-4025: -- Would a video metadata specification's frame count be a better home?

[jira] [Commented] (TIKA-3981) Tika parser meets window system file

2023-02-24 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693140#comment-17693140 ] Nick Burch commented on TIKA-3981: -- Is this happening for all executables on your machine, or just some?

[jira] [Commented] (TIKA-3973) Content of Ogg file with Opus encoded content not correctly recognized

2023-02-15 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689199#comment-17689199 ] Nick Burch commented on TIKA-3973: -- If you only care about container-aware detection for Ogg based

[jira] [Commented] (TIKA-3973) Content of Ogg file with Opus encoded content not correctly recognized

2023-02-15 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689176#comment-17689176 ] Nick Burch commented on TIKA-3973: -- For all container formats you want {{tika-parsers}} or

[jira] [Comment Edited] (TIKA-3973) Content of Ogg file with Opus encoded content not correctly recognized

2023-02-15 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689161#comment-17689161 ] Nick Burch edited comment on TIKA-3973 at 2/15/23 2:38 PM: --- For container-based

[jira] [Commented] (TIKA-3973) Content of Ogg file with Opus encoded content not correctly recognized

2023-02-15 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689161#comment-17689161 ] Nick Burch commented on TIKA-3973: -- For container-based detection (such as the Ogg container format), you

[jira] [Commented] (TIKA-3960) PGP encrypted files get detected as application/octet-stream

2023-01-30 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17682352#comment-17682352 ] Nick Burch commented on TIKA-3960: -- If possible, please include a small test file and update

[jira] [Commented] (TIKA-3703) Consider adding a frictionless data package output format

2023-01-16 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677364#comment-17677364 ] Nick Burch commented on TIKA-3703: -- I guess we could include a data package metadata file to better

[jira] [Commented] (TIKA-3703) Consider adding a frictionless data package output format

2023-01-16 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677326#comment-17677326 ] Nick Burch commented on TIKA-3703: -- A zip file gives you compression, and most clients won't accidentally

[jira] [Commented] (TIKA-3955) separate dependencies from tika-app-2.6.0-noasm-nojson

2023-01-12 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17675914#comment-17675914 ] Nick Burch commented on TIKA-3955: -- The Tika App is intended as a "batteries included" standalone app.

[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-09 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656060#comment-17656060 ] Nick Burch commented on TIKA-3952: -- Is the PDF a scan? Are you doing OCR? > Content mismatch >

[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-09 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656049#comment-17656049 ] Nick Burch commented on TIKA-3952: -- Can you try following the steps in

[jira] [Commented] (TIKA-2536) Move to later edu.ucar version to avoid EOL dependencies

2022-11-02 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627638#comment-17627638 ] Nick Burch commented on TIKA-2536: -- We can only depend on versions in maven central, we can't depend on

[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620633#comment-17620633 ] Nick Burch commented on TIKA-3890: -- DOCX files are compressed XML. Text compresses very well. Already

[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620610#comment-17620610 ] Nick Burch commented on TIKA-3890: -- The only way to be sure of how many pages are in a Word document is

[jira] [Commented] (TIKA-3850) Spanish text is incorrectly detected as Galician

2022-09-13 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603483#comment-17603483 ] Nick Burch commented on TIKA-3850: -- The kind of statistical language model used in Tika struggles with

[jira] [Commented] (TIKA-3308) SVG file without xml declaration tag is detected as text/plain

2022-09-12 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603038#comment-17603038 ] Nick Burch commented on TIKA-3308: -- Our HTML mime type has both root-XML tags for well-formed documents,

[jira] [Commented] (TIKA-3832) Required array length is too large (OOM) error when reading a PDF file

2022-08-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575814#comment-17575814 ] Nick Burch commented on TIKA-3832: -- Any chance you could try with Apache PDFBox directly? They've got a

[jira] [Resolved] (TIKA-3830) Kaspersky identified a file as riskware

2022-08-03 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-3830. -- Resolution: Duplicate > Kaspersky identified a file as riskware >

[jira] [Commented] (TIKA-3829) java.lang.IllegalArgumentException: The document is really a XLS file exception while parsing doc file

2022-08-03 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17574656#comment-17574656 ] Nick Burch commented on TIKA-3829: -- Can you share a file that triggers this bug? The method in question

[jira] [Commented] (TIKA-3814) Extracted text from HTML file does not exclude newline chars from body

2022-07-14 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566991#comment-17566991 ] Nick Burch commented on TIKA-3814: -- I have a feeling that the Text content handler might rely on these

[jira] [Updated] (TIKA-3814) Extracted text from HTML file does not exclude newline chars from body

2022-07-11 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch updated TIKA-3814: - Priority: Trivial (was: Blocker) > Extracted text from HTML file does not exclude newline chars from

[jira] [Commented] (TIKA-3811) Exclude NameDetector not working for Tika.detect(file)

2022-07-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562599#comment-17562599 ] Nick Burch commented on TIKA-3811: -- Maybe [~tallison] has an idea on the config part, he's been working

[jira] [Commented] (TIKA-3811) Exclude NameDetector not working for Tika.detect(file)

2022-07-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562537#comment-17562537 ] Nick Burch commented on TIKA-3811: -- You should not be using Apache Tika's detection for anything security

[jira] [Resolved] (TIKA-3810) Vtt file (encoding UTF-8 with BOM) seen as text/plain

2022-07-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-3810. -- Fix Version/s: 2.4.2 Resolution: Fixed > Vtt file (encoding UTF-8 with BOM) seen as text/plain >

[jira] [Commented] (TIKA-3810) Vtt file (encoding UTF-8 with BOM) seen as text/plain

2022-07-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562532#comment-17562532 ] Nick Burch commented on TIKA-3810: -- Looks like we had detection magic for the UTF16 variant BOMs but not

[jira] [Commented] (TIKA-3809) OutOfMemoryError occurs while reading doc file

2022-07-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562484#comment-17562484 ] Nick Burch commented on TIKA-3809: -- If the uncompressed XML is 250mb, then you're going to need a heap a

[jira] [Commented] (TIKA-3798) Tika hangs up with some RAR archives

2022-06-22 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557343#comment-17557343 ] Nick Burch commented on TIKA-3798: -- With no file, no thread dump and no stack trace, it won't be easy to

[jira] [Commented] (TIKA-3798) Tika hangs up with some RAR archives

2022-06-22 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557319#comment-17557319 ] Nick Burch commented on TIKA-3798: -- Do you have a sample file that shows the problem? A thread dump

[jira] [Commented] (TIKA-3768) message/rfc822 does not include Headers in extracted text

2022-06-09 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552078#comment-17552078 ] Nick Burch commented on TIKA-3768: -- If we can put something into a properly typed + structured metadata

[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2022-06-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550223#comment-17550223 ] Nick Burch commented on TIKA-3784: -- We don't currently have any Mime Magic for PKCS12 files Based on

[jira] [Commented] (TIKA-3768) message/rfc822 does not include Headers in extracted text

2022-06-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550216#comment-17550216 ] Nick Burch commented on TIKA-3768: -- I wouldn't expect to find those in the textual content after parsing,

[jira] [Commented] (TIKA-3771) Regression from TIKA-3687: Files wrongly detected as EML

2022-05-20 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539993#comment-17539993 ] Nick Burch commented on TIKA-3771: -- The PNG magic is priority 50, which is also what our EML min-match 2

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539594#comment-17539594 ] Nick Burch commented on TIKA-3710: -- As a "normal" html file wouldn't start with these snippets, and

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539582#comment-17539582 ] Nick Burch commented on TIKA-3710: -- I was thinking we'd do (open)h1(close) or (open)h1(space) to cover

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-18 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538896#comment-17538896 ] Nick Burch commented on TIKA-3710: -- The h1 isn't quite as unique as we might like, and maybe not as good

[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-29 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529977#comment-17529977 ] Nick Burch commented on TIKA-3571: -- Some formats support the concept of pages and we can pass that along

[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-29 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529918#comment-17529918 ] Nick Burch commented on TIKA-3742: -- Sure! Potentially easiest is if you create your own fork of Tika on

[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-28 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529417#comment-17529417 ] Nick Burch commented on TIKA-3742: -- I believe {{readNBytes}} only came in with Java 9, and the particular

[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529101#comment-17529101 ] Nick Burch commented on TIKA-3742: -- Assuming we just want type=17 text elements of a DGNv7 file (as per

[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529038#comment-17529038 ] Nick Burch commented on TIKA-3742: -- In theory you shouldn't need any java code at all if you don't want,

[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529029#comment-17529029 ] Nick Burch commented on TIKA-3742: -- If it can just be run standalone and then {{ExternalParser}} +

[jira] [Commented] (TIKA-3731) Tika CAD DWG reader not pulling meta data from new cad files

2022-04-26 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528157#comment-17528157 ] Nick Burch commented on TIKA-3731: -- We already do a prefix for several other formats for custom metadata

[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-24 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527158#comment-17527158 ] Nick Burch commented on TIKA-3719: -- Linux and Mac will need quotes around arguments containing spaces. As

[jira] [Commented] (TIKA-3721) DGN parser

2022-04-23 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526776#comment-17526776 ] Nick Burch commented on TIKA-3721: -- We already have a few file types which we send to {{OfficeParser}}

[jira] [Commented] (TIKA-3721) DGN parser

2022-04-22 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526352#comment-17526352 ] Nick Burch commented on TIKA-3721: -- The mime types mentioned at

[jira] [Commented] (TIKA-3721) DGN parser

2022-04-22 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526336#comment-17526336 ] Nick Burch commented on TIKA-3721: -- We've had the OK from the author of the tika-dgn-detector I'd

[jira] [Commented] (TIKA-3721) DGN parser

2022-04-22 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526324#comment-17526324 ] Nick Burch commented on TIKA-3721: -- That detector is written in Kotlin, but should be pretty easy to

[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-21 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525747#comment-17525747 ] Nick Burch commented on TIKA-3719: -- Those look like the steps needed. I'd suggest we create ours as

[jira] [Commented] (TIKA-3725) Add Authorization to Tika Server (Suggest Basic to start off with)

2022-04-21 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525588#comment-17525588 ] Nick Burch commented on TIKA-3725: -- Something like OAuth would be pretty different to basic auth, due to

[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-21 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525578#comment-17525578 ] Nick Burch commented on TIKA-3719: -- For testing it, I'd be tempted to create a self-signed certificate

[jira] [Commented] (TIKA-3721) DGN parser

2022-04-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524718#comment-17524718 ] Nick Burch commented on TIKA-3721: -- After a quick look, I can't spot any free tools or libraries for

[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17517818#comment-17517818 ] Nick Burch commented on TIKA-3571: -- It has been a quite a while since I last used jodconverter, but the

[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-03 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516459#comment-17516459 ] Nick Burch commented on TIKA-3711: -- I'd lean towards putting the file name as an attribute of the img

[jira] [Commented] (TIKA-3696) Add detection for wacz files

2022-03-10 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17504378#comment-17504378 ] Nick Burch commented on TIKA-3696: -- Shouldn't it be more like {{application/x-wacz}}  since it isn't a

[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-10 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17504150#comment-17504150 ] Nick Burch commented on TIKA-3684: -- Same as Tika 2.x - pass a {{--config}} flag when you start the server

[jira] [Resolved] (TIKA-3694) Tika Server endpoint to return more details on a mime type

2022-03-07 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-3694. -- Fix Version/s: 2.3.1 Resolution: Fixed > Tika Server endpoint to return more details on a mime

[jira] [Commented] (TIKA-3694) Tika Server endpoint to return more details on a mime type

2022-03-07 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502627#comment-17502627 ] Nick Burch commented on TIKA-3694: -- I've added new HTML and JSON endpoints {{/mime-types/type/subtype}}

[jira] [Created] (TIKA-3694) Tika Server endpoint to return more details on a mime type

2022-03-07 Thread Nick Burch (Jira)
Nick Burch created TIKA-3694: Summary: Tika Server endpoint to return more details on a mime type Key: TIKA-3694 URL: https://issues.apache.org/jira/browse/TIKA-3694 Project: Tika Issue Type:

[jira] [Commented] (TIKA-3686) CSS file detected as JavaScript (application/javascript)

2022-03-03 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500804#comment-17500804 ] Nick Burch commented on TIKA-3686: -- Detecting types of text-based files with magic is always going to

[jira] [Commented] (TIKA-3676) Consider making dl4j dependencies provided

2022-02-09 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489597#comment-17489597 ] Nick Burch commented on TIKA-3676: -- As long as we provide sensible instructions on what to do, I'm happy

[jira] [Commented] (TIKA-3656) Tika returns wrong content type for docx types.

2022-01-24 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480955#comment-17480955 ] Nick Burch commented on TIKA-3656: -- That POM is your problem, you aren't including any of the container

[jira] [Commented] (TIKA-3656) Tika returns wrong content type for docx types.

2022-01-21 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479981#comment-17479981 ] Nick Burch commented on TIKA-3656: -- How are you calling Tika? And do you have the office parsers on your

[jira] [Commented] (TIKA-3646) MP4 files have their mime type detected as video/quicktime

2022-01-13 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17475269#comment-17475269 ] Nick Burch commented on TIKA-3646: -- I think this is probably the same issue as TIKA-2935 - the same work

[jira] [Commented] (TIKA-3590) OSX DMG files wrong MIME type detection (wrong MediaType and Supertype)

2021-11-16 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444644#comment-17444644 ] Nick Burch commented on TIKA-3590: -- [~salmira] Are you able to create us a few sample dmg files to test

[jira] [Commented] (TIKA-3582) Tika does not respect a configuration value passed over a HTTP Header

2021-10-26 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434493#comment-17434493 ] Nick Burch commented on TIKA-3582: -- Bit fiddly, but how about a config option on the server for the

[jira] [Commented] (TIKA-3570) LYR file detection

2021-10-13 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17428246#comment-17428246 ] Nick Burch commented on TIKA-3570: -- [~delmaestro_l] Does that sample file load in the program that

[jira] [Commented] (TIKA-3570) LYR file detection

2021-10-12 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17427920#comment-17427920 ] Nick Burch commented on TIKA-3570: -- Do you have a small sample file that you can share with us, ideally

[jira] [Commented] (TIKA-3559) Add MIME type for .webmanifest files

2021-09-22 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418531#comment-17418531 ] Nick Burch commented on TIKA-3559: -- I'm not sure if the example in the spec is under a suitable license.

[jira] [Commented] (TIKA-3559) Add MIME type for .webmanifest files

2021-09-22 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418505#comment-17418505 ] Nick Burch commented on TIKA-3559: -- As we get more JSON-based formats, I wonder if we should do a

[jira] [Commented] (TIKA-3558) vulnerability detected in vorbis-tika-java

2021-09-21 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418157#comment-17418157 ] Nick Burch commented on TIKA-3558: -- That seems to be a vulnerability in the libflac C code, so shouldn't

[jira] [Commented] (TIKA-3554) Detect plain text file as application/zip based on file ext wrong

2021-09-16 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416044#comment-17416044 ] Nick Burch commented on TIKA-3554: -- Just to emphasise what Tim has written, file type detection in Apache

[jira] [Commented] (TIKA-3554) Detect plain text file as application/zip based on file ext wrong

2021-09-16 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416009#comment-17416009 ] Nick Burch commented on TIKA-3554: -- If possible, wrap your {{InputStream}} as a {{TikaInputStream}}

[jira] [Commented] (TIKA-3555) Eset antivirus found threat in the GitHub repo after Git clone

2021-09-15 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415782#comment-17415782 ] Nick Burch commented on TIKA-3555: -- Doesn't that make us look more dodgy, and more likely to trigger an

[jira] [Commented] (TIKA-3554) Detect plain text file as application/zip based on file ext wrong

2021-09-15 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415460#comment-17415460 ] Nick Burch commented on TIKA-3554: -- If you want Apache Tika to do detection only on the file contents

[jira] [Commented] (TIKA-3555) Eset antivirus found threat in the GitHub repo after Git clone

2021-09-15 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415439#comment-17415439 ] Nick Burch commented on TIKA-3555: -- See TIKA-259 This file will make an underpowered computer unhappy if

[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-08 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411814#comment-17411814 ] Nick Burch commented on TIKA-3544: -- Apache POI provides the DataFormatter class which attempts to turn

[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-08 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411774#comment-17411774 ] Nick Burch commented on TIKA-3544: -- You need to be aware that Excel itself only stored numbers-as-numbers

[jira] [Commented] (TIKA-3534) Latest Android Studio will fail building Android project with Tika Core 2.0.0 included - issues with MethodHandle API usage

2021-08-22 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17402788#comment-17402788 ] Nick Burch commented on TIKA-3534: -- This class is used by the bits of Apache Tika (mostly parsers) that

[jira] [Commented] (TIKA-3528) WMV file detected as WMA (audio/x-ms-wma)

2021-08-18 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17400934#comment-17400934 ] Nick Burch commented on TIKA-3528: -- The specification document from Microsoft documents the following

  1   2   3   4   5   6   7   8   9   10   >