RE: Very slow parsing of a few PDF files

2017-11-21 Thread Nick Burch
On Tue, 21 Nov 2017, Jim Idle wrote: Following up on this, I will try cancelling my thread based tasks after a pre-set time limit. That is only going to work if Tika and the underlying parsers behave correctly with the interrupted exception. Anyone had any success with that? I am mainly looking

Re: Binary file check

2018-01-11 Thread Nick Burch
On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote: Does Tika library provide an efficient binary file check? How do you define "binary"? Only things with a mimetype that starts text/ ? Or do you want to include application/xml files? Or things that extend form XML like DIF and FictionBook? Only

Re: Parse file without creating tmp file

2018-01-14 Thread Nick Burch
On Fri, 12 Jan 2018, Luís Filipe Nassif wrote: I can list some of them currently needing temp files: jpeg, zip (for detection) and derived (docx, xlsx, pptx), ole2 (for detection) and derived (doc, xls, ppt), mdb, pst, rar, 7zip, sqlite... I've had a go at recording this in the Wiki, along with

Re: How to implement an InputStream that dynamically guesses the extension of a file that is streamed using Apache Tika?

2018-01-14 Thread Nick Burch
On Fri, 12 Jan 2018, Martin Todorov wrote: We're working on implementing a new artifact repository manager. Most of the files in the repositories will be binaries (usually archives such as jar, war, ear, zip, tar, tar.bz2, tar.gz, but not necessarily, or limited to just these). Unfortunately, y

Re: Binary file check

2018-01-14 Thread Nick Burch
On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote: I am not an expert on mime types and how they extend. My definition of binary is any file that is not in human readable form. Any other file, I'd like to index. Would that answer your question? Some of us humans here can read a wide range of form

Re: Binary file check

2018-01-21 Thread Nick Burch
On Fri, 19 Jan 2018, Kudrettin Güleryüz wrote: One more thing, regarding application/xml vs text/xml I think I'll skip application/xml for now and just include text/xml Assuming application/xml is compressed XML such as Open office documents and text/xml as uncompressed XML Nope! They're both

Re: Detect JSON / PDF specific mime type

2018-02-05 Thread Nick Burch
On Mon, 5 Feb 2018, Matteo Alessandroni wrote: I'm using Apache Tika to detect a file Mime Type from its base64 rapresentation. Unfortunately I don't have other info about the file (e.g. extension). and it gives me "text/plain" for JSON and PDF files, but I would like to obtain a more specifi

Re: Long time with OCR

2018-02-20 Thread Nick Burch
On Mon, 19 Feb 2018, Mark Kerzner wrote: Is that a good approach? Is the 10 seconds time normal? I am using the latest most powerful Mac and I get similar results on an i7 processor in Ubuntu. Tika uses the open source Tesseract OCR engine. Tesseract is optimised for ease of contributions and

Re: Malware RTF is not detected as RTF

2018-03-01 Thread Nick Burch
On Thu, 1 Mar 2018, Jim Idle wrote: Malicious RTF files take advantage of the fact that Microsoft do not follow their own RTF spec. Specifically, Word et al only looks for the opening sequence: {rt Thought the spec says it should be: {rtf1 I don't think that Tika can assume that all RTF us

Re: Unable to use -classpath

2018-03-05 Thread Nick Burch
On Sat, 3 Mar 2018, Jean-Nicolas Boulay Desjardins wrote: I am using this command: java -classpath /home/$USER/Projects/Lab/tika/classes/ -jar ./tika-app/target/tika-app-1.17.jar Java ignores -classpath if you also specify -jar In /home/$USER/Projects/Lab/tika/classes/ I have: sqlite-jdbc-3.

Re: Subfile Extraction

2018-03-27 Thread Nick Burch
On Sun, 25 Mar 2018, McGreevy, Anthony wrote: I am currently playing with Tika to see how it works with regards to extraction of subfiles. Do you mean files or resources embedded within another file? If so... With the Tika App, you want -z to have these extracted. With the Tika java classes,

Re: Hex of RSS xml file is not recognized as RSS file MIME type

2018-04-19 Thread Nick Burch
On Wed, 18 Apr 2018, Jean-Nicolas Boulay Desjardins wrote: I converted this RSS XML content to hex: Then send it to Tika... Tika returns: text/plain Base 64 encoded XML is no longer valid XML, so this is as expected. Why am I not getting the rss mime type? You need to send Tika the real

Re: Tika Parsers jar?

2018-04-19 Thread Nick Burch
On Thu, 19 Apr 2018, AJ Weber wrote: But I can't find that jar anywhere in any of the download areas.  (I don't know why, but my maven isn't working properly.) You need to use Maven / Gradle / Ivy to fetch it, and everything it depends on Can someone point me to the location of such a jar an

Re: Forcing Parser Invocation

2018-04-24 Thread Nick Burch
On Mon, 23 Apr 2018, lewis john mcgibbney wrote: Using the tika-server, I am having issues parsing the attachment ENVI hdr file at [0] with the EnviHeaderParser [1]. Is there any way I can explicitly force execution of the EnviHeaderParser? I think not directly on a per-request basis. All the

Re: Google Takeout GChat messages

2018-09-04 Thread Nick Burch
yOn Tue, 4 Sep 2018, Tucker Barbour wrote: I've exported a GMail archive in MBOX format using takeout.google.com. The MBOX archive also includes GChat messages. However, the GChat messages do not include a Date header. Instead the date sent is included in what appears to be a non-conforming RFC

Re: Sample Rate / Audio Sample Rate not included in XML output

2018-10-17 Thread Nick Burch
On Wed, 17 Oct 2018, Tim Allison wrote: This is one of the limitations of a streaming write. As I look at the code of the MP3Parser, I _think_ it would be trivial to write the metadata before writing any content, and it wouldn't get in the way of a streaming parse because the parser reads the wh

Re: Anyone have a nice Unix service script for running Tika Server?

2019-10-16 Thread Nick Burch
On Wed, 16 Oct 2019, Eric Pugh wrote: I’m looking at running Tika Server mode in a Linux box (and sorry, I don’t know the specific flavour….). Is there a nice service script to deal with bring Tika back up if the Linux box is restarted? Are you using a systemd-based linux, or a different one,

Re: Encoding detectors in OSGi (tika-bundle)

2019-11-12 Thread Nick Burch
On Tue, 12 Nov 2019, Katsuya Tomioka wrote: I'm having trouble accessing encoding detectors in OSGi with Tika 1.22. AutoDetectParser returns "Failed to detect the character encoding of a document" for non-Latin text. We are migrating from 1.10, I'm sure many things are different. It seems like

Re: Setting PDF2XHTML img src

2020-01-03 Thread Nick Burch
On Fri, 3 Jan 2020, Mike Dalrymple wrote: I've just started using Tika to process PDFs with embedded images. I'm getting fantastic results but I'm having to post-process the generated XHTML to correct the value of the src attribute on the img elements. That is expected. A simple sax handler sh

Re: WARNING: org.xerial's sqlite-jdbc is not loaded for 1.2.4

2020-04-21 Thread Nick Burch
On Mon, 20 Apr 2020, Bradley Beach wrote: I have tried every permutation of adding sqlite-jdbc-3.30.1.jar to my classpath but still get:   java -classpath ".:sqlite-jdbc-3.30.1.jar" -jar tika-server-1.24.jar --host=localhost --port=12345 You can't combine -classpath and -jar, you have to use on

Re: WARNING: org.xerial's sqlite-jdbc is not loaded for 1.2.4

2020-04-22 Thread Nick Burch
On Wed, 22 Apr 2020, Tim Allison wrote: Y. Agreed. Where should we document this? Where would you look for it? The Tika Server and Tika App both get a fair bit of use from non-Java devs Maybe we need a quickstart for non-Java folks section, and probably a python-specific one as we get loads o

Re: Extract URLs from a document

2020-11-12 Thread Nick Burch
On Wed, 11 Nov 2020, nensick wrote: I am exploring the available features and I managed also to extract Office macros but I still don't find a way to get the links. Imagine to have a PDF, a DOCX in which you have a "click here" text as a link pointing to a website (let's say example[.]com). Ho

Re: Mimetypes

2020-12-22 Thread Nick Burch
On Tue, 22 Dec 2020, Peter Kronenberg wrote: I'm trying to detect the mimetype of a file using both Tika.detect(InputStream) and Tika.detect(File) I get 2 different results. I'm testing with a Microsoft Word (.doc) file. The InputStream one is based on just the first few kb of the file. That

RE: Mimetypes

2020-12-23 Thread Nick Burch
On Tue, 22 Dec 2020, Peter Kronenberg wrote: Oh, so reading the stream doesn't read the whole file? Not for Detect, no. The assumption is that Detect is normally followed by Parse, so you won't want the Stream consuming, so we do a mark/reset to check the first few kb only I know for Office

RE: Mimetypes

2020-12-23 Thread Nick Burch
On Wed, 23 Dec 2020, Peter Kronenberg wrote: But yet, if I understand correctly, using a TikaInputStream *will* spool the entire stream to disk so it can read everything, right? If I re-read the stream to parse, is it making 2 passes? TikaInputStream has logic in it dump the stream to a temp

RE: Mimetypes

2020-12-23 Thread Nick Burch
On Wed, 23 Dec 2020, Peter Kronenberg wrote: Best is to wrap as a TikaInputStream, detect using all the detectors via >DefaultDetector, then parse after that. But sometimes the detect will read the whole file, right? For example, for Word. So is it then making 2 passes? Nope, we stash the

Re: Metadata

2020-12-29 Thread Nick Burch
On Mon, 28 Dec 2020, Peter Kronenberg wrote: For the metadata that comes back from a parse (example below), clearly, the fields are dependent on the file type and information available. Are there any 'standard' fields that come back for all/any files? Such as Author, date, x-parsed-by, etc. I

Re: OCR on PDFs

2020-12-31 Thread Nick Burch
On Thu, 31 Dec 2020, Peter Kronenberg wrote: I've got Tika working with Tesseract on PDF files, but it seems that if I give it a PDF file that has both searchable text and images, the text is OCRed twice. Is this a PDF where some other tool has already done the OCR and stored the text it foun

Re: Error calling ImageMagick

2021-02-12 Thread Nick Burch
On Thu, 11 Feb 2021, Tim Allison wrote: I can replicate this on my windows laptop. The weird thing is that the image file is actually there and if I pause the debugger at the point after imagemagick has complained that the file isn't there but before Tika does the clean up, Windows is funny ab

RE: Re-using a TikaStream

2021-02-23 Thread Nick Burch
On Tue, 23 Feb 2021, Peter Kronenberg wrote: I was re-reading some emails with Nick Burch back around Dec 22-23 and maybe I mis-understood him, but it sounds like he was saying that TiksInputStream was smart enough to automatically spool the stream to disk to allow re-use. If a parser knows

RE: Re-using a TikaStream

2021-03-01 Thread Nick Burch
On Fri, 26 Feb 2021, Peter Kronenberg wrote: For most audio files, using the AudioParser, the buffer is still at the beginning. Even though there is no text extraction, I would think that Tika still needs to read through the stream. The MP3Parser consumes the stream, but the MP4Parser does not

RE: Re-using a TikaStream

2021-03-01 Thread Nick Burch
On Mon, 1 Mar 2021, Peter Kronenberg wrote: But the issue is that different parsers return the stream in different states. Sometimes the stream is all used up (although not closed). And other times, the stream has been re-set to the beginning where it can be re-used. Is this expected behavior

Re: Re-using a TikaStream

2021-03-01 Thread Nick Burch
On Mon, 1 Mar 2021, Tim Allison wrote: detectors should return the stream reset to the beginning. I agree - needs to be ready for the parser to then process Parsers, IIRC, should return the stream fully(?) read but not closed. Not always - if the parser wanted a File then it may not have to

Re: Microsoft alternate fonts on RHEL

2021-03-06 Thread Nick Burch
On Sat, 6 Mar 2021, Subhajit Das wrote: But, the fonts and packages are not available on RHEL, as those are Debian packages. Please suggest alternate option to setup all supported fonts and packages on RHEL. Without a RHEL support login I can't be sure if these help or not, but I'd suggest

Re: TikaServer Header Name is Case-sensitive

2021-03-15 Thread Nick Burch
On Mon, 15 Mar 2021, Subhajit Das wrote: It seems that TikaServer 1.25 header like “X-Tika-PDFOcrStrategy” is case sensitive. Yes. That's bcause those then get mapped onto underlying Java classes and methods, which are case sensitive According to :https://stackoverflow.com/questions/525897

RE: Parsing PDF file - setting threshold of unmapped characters

2021-04-14 Thread Nick Burch
On Wed, 14 Apr 2021, Peter Kronenberg wrote: Anyone have any thoughts on this? I think both an absolute and a percentage would be good, but I don't have enough experience to comment on your suggested numbers for those two thresholds, sorry! Your idea on best vs fast touches on much older di

Re: UNSUBSCRIBE

2021-04-16 Thread Nick Burch
On Fri, 16 Apr 2021, Maloney, Patrick (ITS) wrote: UNSUBSCRIBE To unsubscribe from the Apache Tika users list, send an email to user-unsubscr...@tika.apache.org and then reply to confirm. This info is also included in every email Nick

RE: UNSUBSCRIBE

2021-04-16 Thread Nick Burch
On Fri, 16 Apr 2021, Maloney, Patrick (ITS) wrote: Thanks, but that info is not in the individual e-mails...I checked for that. Hmm, that might be an issue with your email client. Every list message has this in the headers Mailing-List: contact user-h...@tika.apache.org; run by ezmlm

Re: Tika Docker licence

2021-04-16 Thread Nick Burch
On Tue, 13 Apr 2021, Subhajit Das wrote: The Tika Docker image (full) uses ‘ttf-mscorefonts-installer’. The licence used by it is Microsoft licence and dosen’t seems to allow commercial use. Can any please confirm if it is ok to use? Or should a customized version to be used for production?

Re: Tika Docker licence

2021-04-17 Thread Nick Burch
On Sat, 17 Apr 2021, Lewis John McGibbney wrote: Please point me to the code for the ‘ttf-mscorefonts-installer’. The bit of the Tika docker file that pulls them in is: https://github.com/apache/tika-docker/blob/master/full/Dockerfile#L21 I think the EULA (which we auto-accept during installat

Re: best practices for avoiding OOM for tika docker

2021-05-28 Thread Nick Burch
On Thu, 27 May 2021, Cristian Zamfir wrote: I am running some stress tests of the latest tika server docker (not modified in any way, just pulled from the registry) and seeing that after a few hours I see OOM in the logs. The container has a limit of 4GB set in K8S. I am wondering if you have any

Re: best practices for avoiding OOM for tika docker

2021-06-02 Thread Nick Burch
On Wed, 2 Jun 2021, Cristian Zamfir wrote: 1. Do you have a recommendation for a stress test that would allow me to easily test OOM behavior? Depends what kind of OOM you're interested in. If you fire a lot of memory-hungry documents at a single server at once, you can trigger an OOM. Alterna

Re: --header "X-Tika-OCR: false" ; an option to fully disable OCR for each request

2021-06-10 Thread Nick Burch
On Thu, 10 Jun 2021, Cristian Zamfir wrote: It would be nice if this was feasible via the headers of each request. I find it more convenient to use if/else in my code than in the yaml files used for k8s configuration. Is there such an option? Three options, see https://cwiki.apache.org/conflu

Re: --header "X-Tika-OCR: false" ; an option to fully disable OCR for each request

2021-06-10 Thread Nick Burch
On Thu, 10 Jun 2021, Cristian Zamfir wrote: Thanks Nick. Looks like the option I was looking for is the 3rd one, but the docs say it is only available in Tika 2.x - am I right? I've just done a grep of the codebase, and it isn't in the 1.x branch, only main = 2.x. So, Tika 2.x only Nick

Re: --header "X-Tika-OCR: false" ; an option to fully disable OCR for each request

2021-06-10 Thread Nick Burch
On Thu, 10 Jun 2021, Cristian Zamfir wrote: Got it, thanks. What are your thoughts on using Tika 2.x while still in beta? Is it likely to be more stable than 1,26? I presume it has passed the same extensive test suite. Usage stability wise, it's as good as 1.x. API stability wise things are s

Re: logging formatter configuration compatible with StackDriver

2021-06-11 Thread Nick Burch
On Fri, 11 Jun 2021, Cristian Zamfir wrote: I think for most people it would be quite critical to have logs working. Do you happen to know how I can reach out to the person maintaining the docker images https://hub.docker.com/u/dameikle to see if they are available to update the images? Sounds li

Re: dcterms:created date changes on RTF documents

2021-07-22 Thread Nick Burch
On Thu, 22 Jul 2021, David Pilato wrote: TL;DR: the created date of the document changes depending on the timezone. That does seem a bug For example: • Asia/Sakhalin gives dcterms:created=2016-07-06T23:38:00Z • Asia/Colombo gives dcterms:created=2016-07-07T05:08:00Z • Europe/Stockholm gives

Re: Deleted text in Word document

2021-08-27 Thread Nick Burch
On Fri, 27 Aug 2021, Peter Kronenberg wrote: When Tika extracts from a Microsoft Word document, deleted text is extracted, with no indication that it is deleted. In fact, if a word was deleted and replaced by another word, both words just show up side-by-side. Is there a way to get some sort

Re: Tika 2.1.0 pdf parser

2021-10-21 Thread Nick Burch
On Thu, 21 Oct 2021, nskarthik wrote: Question : Need to extract Text / images at page level using java. Did not find any example on www or Tika website. For PDF, you should fetch the contents as XHTML rather than plain text. You can then split on the page divs. This isn't available for forma

Re: Returning file extension alongside mime-type?

2022-02-10 Thread Nick Burch
On Thu, 10 Feb 2022, Willy T. Koch wrote: As for content detection, today the content-type field with mime type is returned. What we would need is a mime-type to file extension lookup and it seems logical that this was also returned by Tika. How are you calling Tika? We already have APIs for t

Re: Returning file extension alongside mime-type?

2022-02-10 Thread Nick Burch
On Thu, 10 Feb 2022, Willy T. Koch wrote: …and calling it as a webservice with Postman/curl. Ah, I think we might not be exposing the full details of the mime types via the server, only details of their parsers and the heirarchy, eg http://localhost:9998/mime-types#audio/vorbis (We have that

Re: Returning file extension alongside mime-type?

2022-02-17 Thread Nick Burch
On Thu, 10 Feb 2022, Nick Burch wrote: On Thu, 10 Feb 2022, Willy T. Koch wrote: …and calling it as a webservice with Postman/curl. Ah, I think we might not be exposing the full details of the mime types via the server, only details of their parsers and the heirarchy, eg http://localhost

Re: Returning file extension alongside mime-type?

2022-02-24 Thread Nick Burch
On Tue, 22 Feb 2022, Tim Allison wrote: I guess the question is how far do we want to bake this in? I could see adding a field for the default extension in the CompositeDetector/DefaultDetector. This would then be triggered on embedded files, too. I can't imagine this would add much cost co

Re: Returning file extension alongside mime-type?

2022-02-24 Thread Nick Burch
On Thu, 24 Feb 2022, Tim Allison wrote: A separate endpoint, then? That would be cleaner. We already have some mime details related endpoints, would be an extension or related endpoint to those, see earlier-thread: https://lists.apache.org/thread/jlym8ypnrj978hmzjgvkc1fpxnc7g51h Nick

Re: Returning file extension alongside mime-type?

2022-03-07 Thread Nick Burch
On Fri, 18 Feb 2022, Willy T. Koch wrote: Den Tor 17 feb 2022, kl. 20:00, skrev Nick Burch: Tika devs - any thoughts on this? It's a pretty small code change (we already have the data on the mime type!), just need feedback on extending the existing API vs adding a new one By also retu

Re: Returning file extension alongside mime-type?

2022-03-11 Thread Nick Burch
On Tue, 8 Mar 2022, Willy T. Koch wrote: That’s fantastic, thank you! Looking forward to testing when the Tika Docker repo is updated with this release. That may take a few weeks, but if you don't mind building Tika from source, you should be able to give it a whirl now. (As far as I'm aware

Re: ForkParser issues with 2.3.0

2022-04-26 Thread Nick Burch
On Tue, 26 Apr 2022, Stephen H wrote: Second, there seems to be some work missing in the handling of metadata from certain parsers when using ForkParser. For example, for OpenDocument ODP and ODS files and Microsoft Open XML formats, while the document text is returned there is no metadata in e

Re: ForkParser issues with 2.3.0

2022-04-26 Thread Nick Burch
On Tue, 26 Apr 2022, Stephen H wrote: On 26/04/2022 12:22, Nick Burch wrote: Are you able to write a short junit unit test case which shows this issue? We have a bunch of small test OOXML and ODF files that could be used I've done this - if I create an issue in Jira with it would that

Re: Custom filter

2022-06-03 Thread Nick Burch
On Fri, 3 Jun 2022, Cihad Guzel wrote: I want to pass the content's words through some filters while parsing in Tika. How can I add custom filtering? Does the content handler work for this? Is there a document about this? A custom content handler is a pretty good way to do that. Tika just use

Re: Datasets for testing large number of attachments

2022-07-26 Thread Nick Burch
On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote: I am currently trying to validate our Tika setup and was looking for a set of example data I could use If you want a small number of files of lots of different types, the test files in the Tika source tree will work. Main set are in tika-pa

Re: Tika documentation?

2022-09-01 Thread Nick Burch
On Thu, 1 Sep 2022, Mark Kerzner SHMsoft, Inc. wrote: Yes, please. If I make some changes, I will start with small ones. I will also verify them with you. Great, thanks in advance for your contributions! Can you please head to https://cwiki.apache.org/confluence/display/tika/ , click Sign Up

Re: Validate MIME-type

2022-09-29 Thread Nick Burch
On Thu, 29 Sep 2022, Peter Conrad wrote: thanks. That's definitely an improvement. But I think it's not sufficient. AFAICS your code uses "aliases" as in "if it's type X then it can also be type Y". However there's also cases where a specific instance of type X can also be type Y but not all ins

Re: Custom Parser Plugin for Tika Server

2022-10-26 Thread Nick Burch
On Wed, 26 Oct 2022, Tim Allison wrote: I've been struggling with this too. Outside of Docker, what I've been doing is using a bin/ directory and throwing everything in there and then starting tika-server: java -cp "bin/*" org.apache.tika.server.core.cli.TikaServerCli ... If we moved to that mo

Re: Paragraph words getting merged

2022-10-31 Thread Nick Burch
On Sun, 30 Oct 2022, Christian Ribeaud wrote: I am using the default configuration. I think, we could reduce my problem to following code snippet: Is there a reason that you aren't using one of the built-in Tika content handlers? Generally they should be taking care of everything for you with

Re: Subset(s) of Tika?

2023-01-05 Thread Nick Burch
On Thu, 5 Jan 2023, Georg.Fischer wrote: The tika.jar has >54 MB, and I suspect that the loading of the big jar (under Windows) is hindering the performance. I should perhaps move to Linux, or try the Tika server. The Tika App jar has always been the "kitchen sink included quickstart" option

Re: Best practice for extracting content and metadata repeatedly

2023-03-06 Thread Nick Burch
On Mon, 6 Mar 2023, Chris Bamford via user wrote: From both performance and thread safety points of view what is the best approach for the use / reuse of the following objects: Tika ParseContext Parser Metadata The Tika object and/or TikaConfig object should only be created once and then re-

Re: Tika incorrectly detecting Canon raw image file .cr3 as video/quicktime

2023-03-22 Thread Nick Burch
On Wed, 22 Mar 2023, Tim Allison wrote: Thank you, Richard, for raising this. In looking at these file formats, it looks like crw is based on ciff, cr2 is based on tiff and cr3 is based on quicktime. Always fun when the core of a format (or at least the container) swaps between versions! F

Re: Run Tika-docker with custom config

2023-04-28 Thread Nick Burch
On Fri, 28 Apr 2023, שי ברק wrote: Inside the container probably - makes more sense to me In that case, create a custom Docker container that adds in your custom config to your Docker image, as per Konstantin's instructions: https://lists.apache.org/thread/l0od2b6tp6odyd661ftjqmkkf27o6hdl Th

Re: Run Tika-docker with custom config

2023-04-28 Thread Nick Burch
On Fri, 28 Apr 2023, שי ברק wrote: I don’t know if it’s possible but I’m trying to avoid typing this ‘ –– config’ when I start the container. I wish to have all of these settings to be written inside the Dockerfile. Since you're doing your own custom docker container, you could override the E

Re: TIKA for MIME type detection

2023-07-27 Thread Nick Burch
On Tue, 20 Jun 2023, Neha Kamat via user wrote: I am currently working on an application wherein I would like to whitelist the filetypes supported by TIKA And discard rest of the files to avoid unknown behaviour/memory leaks. I am currently referring to https://cwiki.apache.org/confluence/displ

Re: Using Tika with another OCR engine

2023-08-08 Thread Nick Burch
On Thu, 3 Aug 2023, Cristian Zamfir wrote: I am interested in trying out Tika with a different OCR engine and wondering how Tesseract is integrated. Largely as "just another parser", but IIRC with a bit of logic to allow the "normal" image parsers to also have a go at the file to grab metadata

Re: PST file parsing

2023-11-29 Thread Nick Burch
On Wed, 29 Nov 2023, Neha Kamat via user wrote: We are currently using TIKA for parsing/extracting content from pst files.Is there a way we can tell parsing engine to parse as list of emails instead of string of emails? Depends how you're calling Tika? Tika App? Tika Server? Python Wrapper? J

Re: Unexpected behavior when inspecting mp4 files with different ISO

2024-04-27 Thread Nick Burch
On Fri, 26 Apr 2024, Mauler, David wrote: I'm in the process of troubleshooting an issue with certain mp4 video files and tika. After a bunch of digging, it appears to be related to whatever ISO is set for the mp4 file. An mp4 with an ISO of 14496-12:2003 will be detected as video/quicktime but

Re: Build Failure

2010-06-23 Thread Nick Burch
On Wed, 23 Jun 2010, McGibbney, Lewis John wrote: and have received the following output when running tests on Apache Tika Parsers Try looking in tika-parsers/target/surefire-reports/ There should be text and xml files for each test, which will describe what's gone wrong. Hopefully they'll p

Re: error parsing visio files

2010-06-24 Thread Nick Burch
On Wed, 23 Jun 2010, Mango wrote: Caused by: java.lang.ArrayIndexOutOfBoundsException: 42 at javax.swing.text.rtf.RTFReader$AttributeTrackingDestination.handleKeyword(Unknown Source) This looks like a fault in the core java rtf parser :/ Do you know how your rtf file was created? I

Re: error parsing visio files

2010-06-24 Thread Nick Burch
On Thu, 24 Jun 2010, Mango wrote: From what I can tell is that thre rest of .rtf files are parsed without a problem. Only those with embedded Visio diagrams create problems. Wierd thing is I just tried parsing a .doc document with embedded .vsd and it was parsed without a problem. One test wo

Re: Limiting the extracted content

2010-06-28 Thread Nick Burch
On Mon, 28 Jun 2010, Jana, Kumar Raja wrote: We use Apache Tika in our application before sending the content to Solr for Indexing. Some of our documents are pretty large (over 150 MB in size with "only text" content over 30 MB). What file formats are these in? There are some file formats (eg

Re: isDescendantOf

2010-08-03 Thread Nick Burch
On Tue, 3 Aug 2010, Chris Bamford wrote: BTW I now get a runtime error with POI when trying to extract text from a Corel presentation (.shw) file - which seems to get classified as an MS Office doc?! We don't currently seem to have any .shw files in the test suite, so it's possible that thing

Re: isDescendantOf

2010-08-03 Thread Nick Burch
On Tue, 3 Aug 2010, Chris Bamford wrote: ... Any chance you could create some sample files, and upload them to jira? We can use them as a basis for future unit tests Yes sure - please point me at the URL https://issues.apache.org/jira/browse/TIKA Exception in thread "main" java.lang.NoSuch

Re: How can I configure Tika to extract dates in single format?

2010-09-08 Thread Nick Burch
On Wed, 8 Sep 2010, Sergiy Karpenko wrote: When I test content and metadata extraction by Tika, I met next usecases: - Date in metadata (DublinCore.DATE, MSOffice.LAST_SAVED, MSOffice.CREATION_DATE) Date returned as String, but format is different for different document types. Probably you alread

Re: Great 2-part blog article on Apache Tika

2010-09-25 Thread Nick Burch
On Fri, 24 Sep 2010, Mattmann, Chris A (388J) wrote: by our very own Nick Burch! :) See here: http://s.apache.org/JMu Glad you like it :) I'll hopefully do another few posts about Tika in the next week or so, but they'll be more about fine grained control of how Tika in Alfr

Re: Supported Metadata Tags

2010-09-29 Thread Nick Burch
On Wed, 29 Sep 2010, Grant Ingersoll wrote: IPTC (image) There is support for this in Staffan Olsson's git fork (see TIKA-482). I'm hoping Staffan will submit an updated patch of this shortly which we'll then be able to apply :) XMP (image/video) -- yes, AFAICT We can generate XMP metad

Re: Compressed RTF / TNEF / LZFU

2010-09-30 Thread Nick Burch
On Thu, 30 Sep 2010, Jan Høydahl / Cominvent wrote: Here's an open source Java implementation of a decompressor: http://www.freeutils.net/source/jtnef/rtfcompressed.jsp Alas that's under the GPL, so can't be used in an official distribution of Tika. (You can use it yourself if you want though,

Re: Compressed RTF / TNEF / LZFU

2010-09-30 Thread Nick Burch
On Thu, 30 Sep 2010, Jan Høydahl / Cominvent wrote: We could implement the decoder without distributing tnef.jar, using Class.forName() and simply disabling the decoder if the jar is not on classpath? Then it is up to the user to download the jar and thereby accept the GPL license. I've got a

Re: Plugging in your own parser to override an existing

2010-10-08 Thread Nick Burch
On Fri, 8 Oct 2010, Jan Høydahl / Cominvent wrote: My question was for a very specific usecase which is easy to do by a small source code modification but perhaps harder to do with configuration only. Looking at the AutoDetectParser source code, the last parser registered for a given mime typ

Re: Plugging in your own parser to override an existing

2010-10-08 Thread Nick Burch
On Fri, 8 Oct 2010, Jan Høydahl / Cominvent wrote: Magic is most often great, but I generally prefer to have some way of explicitly telling the software what to do :) That's very much available to you! See the different constructors to the AutoDetectParser for examples of how to control what d

Re: question and possible error about output xhtml

2010-10-22 Thread Nick Burch
On Thu, 21 Oct 2010, qubit wrote: When translating a text file -- file.txt -- through tika and looking at the raw output, tika is essentially inserting no markup for line breaks or paragraphs. Most of the logic in TXTParser is around languages and types, there's not much on the markup Also,

Re: question and possible error about output xhtml

2010-10-22 Thread Nick Burch
On Fri, 22 Oct 2010, qubit wrote: Thank you for your reply -- I will look into making the patch; it will get me immersed in the code so I understand it better. The code you probably want to look at is TXTParser in the tika-parser package. The parser quickstart guide at http://tika.apache.org/0

Re: error parsing .XLS file

2010-11-05 Thread Nick Burch
On Fri, 5 Nov 2010, Roland Cornelissen wrote: Caused by: java.io.IOException: Unable to read entire block; 1 byte read; expected 512 bytes at org.apache.poi.poifs.storage.RawDataBlock.(RawDataBlock.java:62) This is normally caused by truncated files. However, it might be worth trying with a

Re: Consistent metadata

2010-11-06 Thread Nick Burch
On Sat, 6 Nov 2010, Shay Banon wrote: Just wanted to check in and see if this has progressed since I last asked? There has been some, I'd suggest you try a recent SVN checkout, then open a JIRA if you spot any more cases where two parsers give different responses for the same effective input

Re: problems parsing an xls spreadsheet

2010-12-19 Thread Nick Burch
On Fri, 17 Dec 2010, Shaun Cutts wrote: Caused by: java.lang.NullPointerException at com.sun.org.apache.xml.internal.serializer.ToStream.writeAttrString(ToStream.java:1962) at com.sun.org.apache.xml.internal.serializer.ToStream.processAttributes(ToStream.java:1942) This doesn'

Re: problems parsing an xls spreadsheet

2010-12-20 Thread Nick Burch
On Mon, 20 Dec 2010, Shaun Cutts wrote: As you are being used for scraping purposes, however, you should probably be able to read anything excel can write, including inconsistent unicode. (If it is inconsistent -- I note that I don't receive a "processingInstruction" callback to write the docum

Re: Bouncy Castle - can it be left out?

2010-12-20 Thread Nick Burch
On Mon, 20 Dec 2010, jason.holmb...@emc.com wrote: Just starting to use Tika 0.8 in conjunction with DokuWiki, and I noticed the dependency on Bouncy Castle through PDFBox. Is it possible to remove this dependency, given that we're not using Tika for any encryption purposes? You may be surpri

Re: problems parsing an xls spreadsheet

2010-12-21 Thread Nick Burch
On Tue, 21 Dec 2010, Shaun Cutts wrote: ok, but in when I call parse, then my ContentHandler.characters() callback gets a char [], and this is passed as: (Pdb) ch array('c', '\xa9 2010 Crane Data LLC. All rights reserved.') so when I try unicode I get an error: (Pdb) ch.tounicode() *** ValueE

Re: problems parsing an xls spreadsheet

2010-12-21 Thread Nick Burch
On Wed, 22 Dec 2010, Shaun Cutts wrote: Ok -- I have rewritten my simple output formatter in java (see below). The unicode problems were perhaps java/python problems as you suspected. The real problem is that tika passes an "Attributes" which has a null value returned by getValue(). Are you a

Re: [Excel] Reference to another cell

2011-02-09 Thread Nick Burch
On Wed, 9 Feb 2011, Dmitrii Dimandt wrote: When I convert this file to a text format, I get this: 01/10/10 40,452 the 10th of January 2010 is about 40,450 days from 1st of January 1900, which is how all excel dates actually get stored internally If you take a date cell and reformat it

Re: [Excel] Reference to another cell

2011-02-09 Thread Nick Burch
On Wed, 9 Feb 2011, Dmitrii Dimandt wrote: So I guess that the problem is probably inherent in Excel itself and I pity Apple's develpers for getting this right. Because, on top of all things, 01/10/10 (which is October 10th, 2010 over here in Europe) is 40450 days from January 1st, 1900. The

Re: what are the list of supported file type in tika 0.9 ?

2011-03-28 Thread Nick Burch
On Mon, 28 Mar 2011, Roberto Martelloni wrote: I'm trying to find the list of all supported file type in tika 0.9, anyone can suggest to me where to find it ? Easiest way is to ask tika-app: java -jar tika-app-0.9.jar --list-parser-details That'll give you back the list of all the avai

Re: XMP Metadata extraction

2011-03-28 Thread Nick Burch
On Mon, 28 Mar 2011, Withanage, Dulip wrote: We are interesting in extracting the row metadata (not formatted in XHML as SAX events) from the files using tika. Generally speaking, all of the metadata that is extracted is placed into the Metadata object you supply when parsing. The SAX events a

RE: XMP Metadata extraction

2011-03-28 Thread Nick Burch
On Mon, 28 Mar 2011, Withanage, Dulip wrote: thank you for your prompt help, that looks promising. Here is my user case. 1. generate the tika app using the source. 2. integrate itto a thirdparty application 3. use the tika extraction fuctionalities for images. 4. I have attached the image, it'

  1   2   3   4   5   >