Re: Copilot license for open source?

2024-04-22 Thread Nick Burch
On Sun, 21 Apr 2024, Michael Wechner wrote: Thanks for the pointer to the Generative Tooling rules, which I was not aware of so far. At the bottom it says, that the ASF does not tell developers what tools to use, but I think it would be useful to useful to have some concrete examples, which

Re: Copilot license for open source?

2024-04-21 Thread Nick Burch
On Fri, 19 Apr 2024, Nicholas DiPiazza wrote: Can I get an open source license for GitHub copilot? I've not heard of anyone offering that. Some of the open and open-ish models are quite good on coding tasks, though you'd need to hop to a different interface to ask for help (unlike the

Re: junk cves -- rant

2024-04-12 Thread Nick Burch
On Thu, 11 Apr 2024, Tim Allison wrote: I just excluded joda-time because of this: CVE-2024-23080 https://nvd.nist.gov/vuln/detail/CVE-2024-23080 This is an NPE in joda-time version 2.12.5. That's two versions before the current...is it actually still in there. And more importantly, an NPE is

Re: Document chunking

2024-04-08 Thread Nick Burch
On Mon, 8 Apr 2024, Tim Allison wrote: Not sure we should jump on the bandwagon, but anything we can do to support smart chunking would benefit us. Could just be more integrations with parsers that turn out to be useful. I haven’t had much joy with some. Here’s one that I haven’t evaluated

[jira] [Commented] (TIKA-4223) STL file exported with OpenSCAD not detected correctly

2024-03-26 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830867#comment-17830867 ] Nick Burch commented on TIKA-4223: -- A lot of the early file extension allocations were taken from

[jira] [Commented] (TIKA-4210) Not able to identify tika extension

2024-03-14 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827017#comment-17827017 ] Nick Burch commented on TIKA-4210: -- The attached file seems to be an RTF file. I'm not sure what a "

[jira] [Commented] (TIKA-4208) OOM error in SAS7BDATParser

2024-03-09 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824965#comment-17824965 ] Nick Burch commented on TIKA-4208: -- I would expect that the json output version would need a bit more

[jira] [Commented] (TIKA-4208) OOM error in SAS7BDATParser

2024-03-08 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824874#comment-17824874 ] Nick Burch commented on TIKA-4208: -- How much heap size do you have allocated? The error suggests

[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2024-02-12 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816788#comment-17816788 ] Nick Burch commented on TIKA-3784: -- >From [https://datatracker.ietf.org/doc/rfc7292/] it looks l

[jira] [Commented] (TIKA-4148) Support Autodesk Inventor files (.ipt) (.iam) (.ipn) (.idw)

2023-11-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17787608#comment-17787608 ] Nick Burch commented on TIKA-4148: -- For detection of the OLE2 based files, we don't need to find unique

[jira] [Updated] (TIKA-4119) Return media type "text/javascript" instead of "application/javascript to follow RFC-9239

2023-09-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch updated TIKA-4119: - Component/s: mime > Return media type "text/javascript" instead of "application/javas

[jira] [Updated] (TIKA-4119) Return media type "text/javascript" instead of "application/javascript to follow RFC-9239

2023-09-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch updated TIKA-4119: - Labels: tika-3x (was: ) > Return media type "text/javascript" instead of "appl

[jira] [Commented] (TIKA-4119) Return media type "text/javascript" instead of "application/javascript to follow RFC-9239

2023-08-29 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759921#comment-17759921 ] Nick Burch commented on TIKA-4119: -- I wonder if this is a big enough change around Detection that we

[jira] [Commented] (TIKA-4062) OfflineContentHandler/ContentHandlerDecorator does not provide option for custom error handling

2023-08-02 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750344#comment-17750344 ] Nick Burch commented on TIKA-4062: -- Between holidays and the length of time needed for regression runs

[jira] [Commented] (TIKA-4064) Update to 2.8.1

2023-07-28 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17748454#comment-17748454 ] Nick Burch commented on TIKA-4064: -- Depends if anyone else on the PMC has the time to be release manager

[jira] [Commented] (TIKA-3948) Require Java 11 in 3.x

2023-07-28 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17748452#comment-17748452 ] Nick Burch commented on TIKA-3948: -- [~solomax] I think the first task is to identify any other areas

[jira] [Commented] (TIKA-4098) Detection fails on PDF with garbage before header

2023-07-10 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17741578#comment-17741578 ] Nick Burch commented on TIKA-4098: -- The more bytes beyond the start we check for the PDF marker, the more

[jira] [Commented] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml

2023-06-08 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730728#comment-17730728 ] Nick Burch commented on TIKA-4060: -- I'm a muppet... had forgotten to escape the hex characters

[jira] [Resolved] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml

2023-06-08 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-4060. -- Fix Version/s: 2.8.1 Resolution: Fixed > Add magic to audio/aac in tika-mimetypes.

[jira] [Commented] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml

2023-06-08 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730649#comment-17730649 ] Nick Burch commented on TIKA-4060: -- 0x494443 is the string ID3, which I think ought to be at the start

[jira] [Commented] (TIKA-4060) Add magic to audio/aac in tika-mimetypes.xml

2023-06-07 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730304#comment-17730304 ] Nick Burch commented on TIKA-4060: -- I have created some small test AAC files using ffmpeg, and then had

[jira] [Commented] (TIKA-4051) Explore new parsers

2023-06-03 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17728992#comment-17728992 ] Nick Burch commented on TIKA-4051: -- Last time I asked the MPXJ project they weren't interested

[jira] [Commented] (TIKA-3999) audio/xm audio/x-mod

2023-05-23 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725561#comment-17725561 ] Nick Burch commented on TIKA-3999: -- Oh, this brings back memories... good memories :) Unless we can

[jira] [Commented] (TIKA-4045) DBF/MDB row count extraction

2023-05-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724302#comment-17724302 ] Nick Burch commented on TIKA-4045: -- I guess this could also apply for other row-based formats like SQLite

[jira] [Commented] (TIKA-4025) Extract frame count from gifs

2023-05-02 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718674#comment-17718674 ] Nick Burch commented on TIKA-4025: -- Would a video metadata specification's frame count be a better home

Re: idea about creation of accounts

2023-03-13 Thread Nick Burch
On Mon, 13 Mar 2023, Nicholas DiPiazza wrote: can we require that the request form for creating a jira account contains the first issue they would like to create? You'd need to ask on users@infra about that, it's an ASF wide thing (to avoid a huge spam problem) and not something our project

[jira] [Commented] (TIKA-3981) Tika parser meets window system file

2023-02-24 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693140#comment-17693140 ] Nick Burch commented on TIKA-3981: -- Is this happening for all executables on your machine, or just some

[jira] [Commented] (TIKA-3973) Content of Ogg file with Opus encoded content not correctly recognized

2023-02-15 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689199#comment-17689199 ] Nick Burch commented on TIKA-3973: -- If you only care about container-aware detection for Ogg based

[jira] [Commented] (TIKA-3973) Content of Ogg file with Opus encoded content not correctly recognized

2023-02-15 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689176#comment-17689176 ] Nick Burch commented on TIKA-3973: -- For all container formats you want {{tika-parsers}} or {{tika-parsers

[jira] [Comment Edited] (TIKA-3973) Content of Ogg file with Opus encoded content not correctly recognized

2023-02-15 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689161#comment-17689161 ] Nick Burch edited comment on TIKA-3973 at 2/15/23 2:38 PM: --- For container-based

[jira] [Commented] (TIKA-3973) Content of Ogg file with Opus encoded content not correctly recognized

2023-02-15 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689161#comment-17689161 ] Nick Burch commented on TIKA-3973: -- For container-based detection (such as the Ogg container format), you

[jira] [Commented] (TIKA-3960) PGP encrypted files get detected as application/octet-stream

2023-01-30 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17682352#comment-17682352 ] Nick Burch commented on TIKA-3960: -- If possible, please include a small test file and update {{tika

[jira] [Commented] (TIKA-3703) Consider adding a frictionless data package output format

2023-01-16 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677364#comment-17677364 ] Nick Burch commented on TIKA-3703: -- I guess we could include a data package metadata file to better

[jira] [Commented] (TIKA-3703) Consider adding a frictionless data package output format

2023-01-16 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677326#comment-17677326 ] Nick Burch commented on TIKA-3703: -- A zip file gives you compression, and most clients won't accidentally

[jira] [Commented] (TIKA-3955) separate dependencies from tika-app-2.6.0-noasm-nojson

2023-01-12 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17675914#comment-17675914 ] Nick Burch commented on TIKA-3955: -- The Tika App is intended as a "batteries included" stan

[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-09 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656060#comment-17656060 ] Nick Burch commented on TIKA-3952: -- Is the PDF a scan? Are you doing OCR? > Content misma

[jira] [Commented] (TIKA-3952) Content mismatch

2023-01-09 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656049#comment-17656049 ] Nick Burch commented on TIKA-3952: -- Can you try following the steps in [https://cwiki.apache.org

[jira] [Commented] (TIKA-2536) Move to later edu.ucar version to avoid EOL dependencies

2022-11-02 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627638#comment-17627638 ] Nick Burch commented on TIKA-2536: -- We can only depend on versions in maven central, we can't depend

[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620633#comment-17620633 ] Nick Burch commented on TIKA-3890: -- DOCX files are compressed XML. Text compresses very well. Already

[jira] [Commented] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

2022-10-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620610#comment-17620610 ] Nick Burch commented on TIKA-3890: -- The only way to be sure of how many pages are in a Word document

Re: Possibly speeding up tests with Gradle - anyone interested?

2022-10-06 Thread Nick Burch
On Thu, 6 Oct 2022, Tim Allison wrote: Happy to chat. Please put them in touch. Excellent, thanks Tim! Other than your past talks, have we got any info (eg on the wiki?) about how to run the regression corpus? I've been really impressed with what the POI team has done migrating from ant

Re: Possibly speeding up tests with Gradle - anyone interested?

2022-10-06 Thread Nick Burch
On Wed, 5 Oct 2022, Nicholas DiPiazza wrote: Are they offering the Gradle Build Cache stuff free for apache projects? There's an announcement at ApacheCon in about an hour... I think the Infra team are still working out the details on how it'll all work. However, there's an additional offer

Re: Possibly speeding up tests with Gradle - anyone interested?

2022-10-06 Thread Nick Burch
On Wed, 5 Oct 2022, Oleg Tikhonov wrote: Honestly I am trying to port our project to gradle. But it goes not well. It is good idea. Is some folk can help, we can do it together. Apparently Gradle Enterprise works with both Gradle and Maven! So we don't even have to change our build -

Possibly speeding up tests with Gradle - anyone interested?

2022-10-05 Thread Nick Burch
Hi All At ApacheCon this week, a Bob and myself ended up chatting with the folks from Gradle, who are keen to help ASF projects, and are discussing with the Infra team. The easier bit - they think they might be able to help speed up our maven build, especially the running of tests. Anyone

Re: GUI mods?

2022-09-25 Thread Nick Burch
On Sat, 24 Sep 2022, Tim Allison wrote: Electron and which framework? I'd say there's two choice mechanisms. One is to pick whatever most excites you / is likely to look best on your next funding application, and say that since you're doing most of the initial work you can choose! The

Re: GUI mods?

2022-09-24 Thread Nick Burch
On Sat, 24 Sep 2022, Tim Allison wrote: Given that this is greenfields, should I start w javafx or stick w swing or is there another framework I should try? Give the Tika Server an optional snazzy web UI, then wrap it as an electron app for people who want a native program to start? (plus

RE: Issue related to file mime type detection

2022-09-15 Thread Nick Burch
On Thu, 15 Sep 2022, Sindhu Mahadevappa wrote: We have been looking for the latest Tika 2.4.1 jar file, looks like it is not available anywhere. You can get the Tika App and Tika Server jars for 2.4.1 from https://tika.apache.org/download.html For the core and parser jars, manually

[jira] [Commented] (TIKA-3850) Spanish text is incorrectly detected as Galician

2022-09-13 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603483#comment-17603483 ] Nick Burch commented on TIKA-3850: -- The kind of statistical language model used in Tika struggles

[jira] [Commented] (TIKA-3308) SVG file without xml declaration tag is detected as text/plain

2022-09-12 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603038#comment-17603038 ] Nick Burch commented on TIKA-3308: -- Our HTML mime type has both root-XML tags for well-formed documents

Re: Issue related to file mime type detection

2022-09-09 Thread Nick Burch
On Fri, 9 Sep 2022, Sindhu Mahadevappa wrote: We are using tika-parsers 1.23 Tika 1.23 was released in December 2019! You should really use something much more recent for comparing uploaded file mime type from file name as well as from file content for security purpose. Apache Tika's

[jira] [Commented] (TIKA-3832) Required array length is too large (OOM) error when reading a PDF file

2022-08-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575814#comment-17575814 ] Nick Burch commented on TIKA-3832: -- Any chance you could try with Apache PDFBox directly? They've got

[jira] [Resolved] (TIKA-3830) Kaspersky identified a file as riskware

2022-08-03 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-3830. -- Resolution: Duplicate > Kaspersky identified a file as riskw

[jira] [Commented] (TIKA-3829) java.lang.IllegalArgumentException: The document is really a XLS file exception while parsing doc file

2022-08-03 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17574656#comment-17574656 ] Nick Burch commented on TIKA-3829: -- Can you share a file that triggers this bug? The method in question

[jira] [Commented] (TIKA-3814) Extracted text from HTML file does not exclude newline chars from body

2022-07-14 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566991#comment-17566991 ] Nick Burch commented on TIKA-3814: -- I have a feeling that the Text content handler might rely

[jira] [Updated] (TIKA-3814) Extracted text from HTML file does not exclude newline chars from body

2022-07-11 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch updated TIKA-3814: - Priority: Trivial (was: Blocker) > Extracted text from HTML file does not exclude newline chars f

[jira] [Commented] (TIKA-3811) Exclude NameDetector not working for Tika.detect(file)

2022-07-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562599#comment-17562599 ] Nick Burch commented on TIKA-3811: -- Maybe [~tallison] has an idea on the config part, he's been working

[jira] [Commented] (TIKA-3811) Exclude NameDetector not working for Tika.detect(file)

2022-07-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562537#comment-17562537 ] Nick Burch commented on TIKA-3811: -- You should not be using Apache Tika's detection for anything security

[jira] [Resolved] (TIKA-3810) Vtt file (encoding UTF-8 with BOM) seen as text/plain

2022-07-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-3810. -- Fix Version/s: 2.4.2 Resolution: Fixed > Vtt file (encoding UTF-8 with BOM) seen as text/pl

[jira] [Commented] (TIKA-3810) Vtt file (encoding UTF-8 with BOM) seen as text/plain

2022-07-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562532#comment-17562532 ] Nick Burch commented on TIKA-3810: -- Looks like we had detection magic for the UTF16 variant BOMs

[jira] [Commented] (TIKA-3809) OutOfMemoryError occurs while reading doc file

2022-07-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562484#comment-17562484 ] Nick Burch commented on TIKA-3809: -- If the uncompressed XML is 250mb, then you're going to need a heap

[jira] [Commented] (TIKA-3798) Tika hangs up with some RAR archives

2022-06-22 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557343#comment-17557343 ] Nick Burch commented on TIKA-3798: -- With no file, no thread dump and no stack trace, it won't be easy

[jira] [Commented] (TIKA-3798) Tika hangs up with some RAR archives

2022-06-22 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557319#comment-17557319 ] Nick Burch commented on TIKA-3798: -- Do you have a sample file that shows the problem? A thread dump

[jira] [Commented] (TIKA-3768) message/rfc822 does not include Headers in extracted text

2022-06-09 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552078#comment-17552078 ] Nick Burch commented on TIKA-3768: -- If we can put something into a properly typed + structured metadata

[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2022-06-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550223#comment-17550223 ] Nick Burch commented on TIKA-3784: -- We don't currently have any Mime Magic for PKCS12 files Based

[jira] [Commented] (TIKA-3768) message/rfc822 does not include Headers in extracted text

2022-06-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550216#comment-17550216 ] Nick Burch commented on TIKA-3768: -- I wouldn't expect to find those in the textual content after parsing

[jira] [Commented] (TIKA-3771) Regression from TIKA-3687: Files wrongly detected as EML

2022-05-20 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539993#comment-17539993 ] Nick Burch commented on TIKA-3771: -- The PNG magic is priority 50, which is also what our EML min-match 2

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539594#comment-17539594 ] Nick Burch commented on TIKA-3710: -- As a "normal" html file wouldn't start with thes

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539582#comment-17539582 ] Nick Burch commented on TIKA-3710: -- I was thinking we'd do (open)h1(close) or (open)h1(space) to cover

[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-18 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538896#comment-17538896 ] Nick Burch commented on TIKA-3710: -- The h1 isn't quite as unique as we might like, and maybe not as good

[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-29 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529977#comment-17529977 ] Nick Burch commented on TIKA-3571: -- Some formats support the concept of pages and we can pass that along

[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-29 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529918#comment-17529918 ] Nick Burch commented on TIKA-3742: -- Sure! Potentially easiest is if you create your own fork of Tika

[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-28 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529417#comment-17529417 ] Nick Burch commented on TIKA-3742: -- I believe {{readNBytes}} only came in with Java 9, and the particular

[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529101#comment-17529101 ] Nick Burch commented on TIKA-3742: -- Assuming we just want type=17 text elements of a DGNv7 file (as per

[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529038#comment-17529038 ] Nick Burch commented on TIKA-3742: -- In theory you shouldn't need any java code at all if you don't want

[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529029#comment-17529029 ] Nick Burch commented on TIKA-3742: -- If it can just be run standalone and then {{ExternalParser

[jira] [Commented] (TIKA-3731) Tika CAD DWG reader not pulling meta data from new cad files

2022-04-26 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528157#comment-17528157 ] Nick Burch commented on TIKA-3731: -- We already do a prefix for several other formats for custom metadata

[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-24 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527158#comment-17527158 ] Nick Burch commented on TIKA-3719: -- Linux and Mac will need quotes around arguments containing spaces

[jira] [Commented] (TIKA-3721) DGN parser

2022-04-23 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526776#comment-17526776 ] Nick Burch commented on TIKA-3721: -- We already have a few file types which we send to {{OfficeParser

[jira] [Commented] (TIKA-3721) DGN parser

2022-04-22 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526352#comment-17526352 ] Nick Burch commented on TIKA-3721: -- The mime types mentioned at [https://communities.bentley.com

[jira] [Commented] (TIKA-3721) DGN parser

2022-04-22 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526336#comment-17526336 ] Nick Burch commented on TIKA-3721: -- We've had the OK from the author of the tika-dgn-detector I'd

Re: Re-implementing tika-dgn-detector in Tika itself - any objections?

2022-04-22 Thread Nick Burch
, so if it's a problem, feel free to change or ignore. Cheers On Fri, 22 Apr 2022 at 11:57, Nick Burch wrote: Hi Steven Over on https://issues.apache.org/jira/browse/TIKA-3721, one of our users altered us to your tika-dgn-detector github project. If possible, we'd like to fold the detector logic

Re-implementing tika-dgn-detector in Tika itself - any objections?

2022-04-22 Thread Nick Burch
Hi Steven Over on https://issues.apache.org/jira/browse/TIKA-3721, one of our users altered us to your tika-dgn-detector github project. If possible, we'd like to fold the detector logic and mime type definitions into Tika itself. (Converting it to Java in the process and putting the

[jira] [Commented] (TIKA-3721) DGN parser

2022-04-22 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526324#comment-17526324 ] Nick Burch commented on TIKA-3721: -- That detector is written in Kotlin, but should be pretty easy to re

[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-21 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525747#comment-17525747 ] Nick Burch commented on TIKA-3719: -- Those look like the steps needed. I'd suggest we create ours

[jira] [Commented] (TIKA-3725) Add Authorization to Tika Server (Suggest Basic to start off with)

2022-04-21 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525588#comment-17525588 ] Nick Burch commented on TIKA-3725: -- Something like OAuth would be pretty different to basic auth, due

[jira] [Commented] (TIKA-3719) Tika Server Ability to Run HTTPs

2022-04-21 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525578#comment-17525578 ] Nick Burch commented on TIKA-3719: -- For testing it, I'd be tempted to create a self-signed certificate

[jira] [Commented] (TIKA-3721) DGN parser

2022-04-19 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524718#comment-17524718 ] Nick Burch commented on TIKA-3721: -- After a quick look, I can't spot any free tools or libraries

[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-05 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17517818#comment-17517818 ] Nick Burch commented on TIKA-3571: -- It has been a quite a while since I last used jodconverter

[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-03 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516459#comment-17516459 ] Nick Burch commented on TIKA-3711: -- I'd lean towards putting the file name as an attribute of the img tag

[jira] [Commented] (TIKA-3696) Add detection for wacz files

2022-03-10 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17504378#comment-17504378 ] Nick Burch commented on TIKA-3696: -- Shouldn't it be more like {{application/x-wacz}}  since it isn't

[jira] [Commented] (TIKA-3684) Extract text returns the text multiple times

2022-03-10 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17504150#comment-17504150 ] Nick Burch commented on TIKA-3684: -- Same as Tika 2.x - pass a {{--config}} flag when you start the server

[jira] [Resolved] (TIKA-3694) Tika Server endpoint to return more details on a mime type

2022-03-07 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-3694. -- Fix Version/s: 2.3.1 Resolution: Fixed > Tika Server endpoint to return more details on a m

[jira] [Commented] (TIKA-3694) Tika Server endpoint to return more details on a mime type

2022-03-07 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502627#comment-17502627 ] Nick Burch commented on TIKA-3694: -- I've added new HTML and JSON endpoints {{/mime-types/type/subtype

[jira] [Created] (TIKA-3694) Tika Server endpoint to return more details on a mime type

2022-03-07 Thread Nick Burch (Jira)
Nick Burch created TIKA-3694: Summary: Tika Server endpoint to return more details on a mime type Key: TIKA-3694 URL: https://issues.apache.org/jira/browse/TIKA-3694 Project: Tika Issue Type

[jira] [Commented] (TIKA-3686) CSS file detected as JavaScript (application/javascript)

2022-03-03 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500804#comment-17500804 ] Nick Burch commented on TIKA-3686: -- Detecting types of text-based files with magic is always going

[jira] [Commented] (TIKA-3676) Consider making dl4j dependencies provided

2022-02-09 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489597#comment-17489597 ] Nick Burch commented on TIKA-3676: -- As long as we provide sensible instructions on what to do, I'm happy

[jira] [Commented] (TIKA-3656) Tika returns wrong content type for docx types.

2022-01-24 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480955#comment-17480955 ] Nick Burch commented on TIKA-3656: -- That POM is your problem, you aren't including any of the container

[jira] [Commented] (TIKA-3656) Tika returns wrong content type for docx types.

2022-01-21 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479981#comment-17479981 ] Nick Burch commented on TIKA-3656: -- How are you calling Tika? And do you have the office parsers on your

[jira] [Commented] (TIKA-3646) MP4 files have their mime type detected as video/quicktime

2022-01-13 Thread Nick Burch (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17475269#comment-17475269 ] Nick Burch commented on TIKA-3646: -- I think this is probably the same issue as TIKA-2935 - the same work

Re: wiki editor access request

2022-01-07 Thread Nick Burch
On Fri, 7 Jan 2022, Josh Burchard wrote: I wrote to Tim about making a small update to https://cwiki.apache.org/confluence/display/TIKA/TikaServerEndpointsCompared and he suggested that I email this dev list to see if someone could grant me editor access. Is that a possibility? Can you sign up

  1   2   3   4   5   6   7   8   9   10   >