Re: Problem with detection of .mbox file

2016-07-25 Thread Nick Burch
On Mon, 25 Jul 2016, Vjeran Marcinko wrote: I fist noticed that my .mbox file doesn't get parsed by MBoxParser, and later, after debugging Tika source code, I found what the problem is - default detector doesn't even recognize it as "applciation/mbox" MIME type, and although file extension is .mb

Re: DATE metadata from email

2016-05-15 Thread Nick Burch
On Sun, 15 May 2016, Philipp Steinkrüger wrote: To begin with, I noticed the following behaviour which might or might not be a bug. I asked this question on stackexchange (https://stackoverflow.com/questions/37226842/tika-metadata-from-email-misses-date

My "What's new with Apache Tika 2.0" talk slides

2016-05-11 Thread Nick Burch
Hi All For those who couldn't make it to Vancouver this week, the slides from my "What's new with Apache Tika 2.0" talk are now available online: http://www.slideshare.net/NickBurch2/apache-tika-whats-new-with-20 The audio was recorded, hopefully that will be available to go with the slides i

Re: XML Parser with type recognition

2016-05-11 Thread Nick Burch
On Wed, 11 May 2016, plug...@free.fr wrote: If you can take a look at my little gist example https://gist.github.com/anonymous/3506db4367040ea8f381c5b7b435b3f9 it will be very helpful. The localName parameter is case sensitive. Your sample file starts with Nick

Re: XML Parser with type recognition

2016-05-11 Thread Nick Burch
On Wed, 11 May 2016, plug...@free.fr wrote: Ok if I understand I can create a specific mime type into tika-mimetypes.xml resource file like this: http://www.w3.org/2001/XMLSchema-instance"/> Almost - you can't set that glob as it's already claimed. Otherwise, assuming that is the righ

Re: XML Parser with type recognition

2016-05-10 Thread Nick Burch
On Tue, 10 May 2016, plug...@free.fr wrote: But now I'm facing of detecting some XML files but only some specifics, I can't detect only "application/xml", I need to detect which type of XML is it (in my case http://www.iab.com/guidelines/digital-video-ad-serving-template-vast-3-0/). But the fi

Re: disable extraction of images

2016-04-13 Thread Nick Burch
On Wed, 13 Apr 2016, ron.vandenbranden wrote: Is it possible to disable text extraction from images inside a PDF file? I'm testing with the CLI tika app, which has "extractInlineImages" set to false by default, if I'm not mistaken. Yet, the text of the images still is present in the generated H

Re: Fwd: How to enable multiple parsers for content type ?

2016-03-23 Thread Nick Burch
On Wed, 23 Mar 2016, Thamme Gowda N. wrote: Question : How to enable multiple parsers for specific mimetypes? I am using tika to parse html pages. My requirement is that both *NamedEntityParser* and *HtmlParser* has to be enabled for specific web related MIME types like *text/html, * *applicati

Re: Jackson & Fat tika-server jar question

2016-02-23 Thread Nick Burch
On Tue, 23 Feb 2016, John Patrick wrote: I'm working with an existing code base that is using Jackson 2.6.3. Now adding tika but because the tika-server jar containers Jackson 2.4.0 having lots of compile issues. 1) Was it intentional to have a bloated/fat tika-server jar containing all dependen

Re: Using tika-app-1.11.jar

2016-02-11 Thread Nick Burch
On Wed, 10 Feb 2016, Steven White wrote: I'm including tika-app-1.11.jar with my application and see that Tika includes "slf4j". The Tika App single jar is intended for standalone use. It's not generally recommended to be included as part of a wider application, as it tends to include everyth

Re: Detecting if a file type is supported or not

2016-02-05 Thread Nick Burch
On Fri, 5 Feb 2016, Steven White wrote: For the missing JAR part Set your Load Error Handler to Warn or Error to find out about parsers with missing classes or dependencies This won't do. What's happening now is if I give Tika a JAR file to parse, it is throwing NoClassDefFoundError exceptio

Re: Detecting if a file type is supported or not

2016-02-05 Thread Nick Burch
On Fri, 5 Feb 2016, Steven White wrote: How do I detect if a file type is supported or not? Run the detection only. If you get anything other than application/octet-stream back, Tika was able to detect it Also, how do I detect if a file type is supported but it cannot be processed because t

Re: Using Tika that comes with Solr 5.2

2016-02-05 Thread Nick Burch
On Fri, 5 Feb 2016, Steven White wrote: I went over to Tika's home page and tried to figure out what are the JARs I need (so I don't have to use Tika's JARs that come with Solr). I looked around and couldn't find a "dist" of the JARs. There isn't one - it's expected that you'll be using Maven

Re: RTF exception

2016-02-03 Thread Nick Burch
On Wed, 3 Feb 2016, Andrea Asta wrote: I'm having an exception when converting a RTF document with the standard new Tika().parseToString(). Any chance you could open a new bug in jira, and attach the smallest RTF file you have which shows up the problem? Nick

RE: Using Tika that comes with Solr 5.2

2016-02-03 Thread Nick Burch
On Wed, 3 Feb 2016, Uwe Schindler wrote: The reason for this behaviour is part of TIKA: If a parser cannot load because of classes it refers to are missing, it is automatically disabled. Because you missed the actual PDF/Powerpoint/… classes, this is what happens for all those parsers. I wond

Re: Using Tika that comes with Solr 5.2

2016-02-03 Thread Nick Burch
On Tue, 2 Feb 2016, Steven White wrote: What I'm finding is that Tika will not extract the raw text off PDF, Powerpoint, ets. files but it will off raw text files. I'd suggest you try some of the steps in the troubleshooting page: http://wiki.apache.org/tika/Troubleshooting%20Tika Probably st

Re: Fwd: Issues adding custom content-type

2016-02-02 Thread Nick Burch
On Tue, 2 Feb 2016, James Brooking wrote: I tried to add a classpath attribute but that didn't seem to change anything: java -classpath "." -jar tika-server-1.11.jar -h 0.0.0.0 The -jar and -classpath options are sadly mutually incompatible Try with: -classpath .:tika-server-1.11.jar org.apa

Re: Fwd: Issues adding custom content-type

2016-02-02 Thread Nick Burch
On Tue, 2 Feb 2016, James Brooking wrote: I created a custom content-type like so: application/hello application/hello This was saved into file called parsers.xml. That's not a custom mime type / content type file, that seems to be a custom Tika XML file. You

Re: Detection problem with RFC822 file with HTML content

2015-11-13 Thread Nick Burch
On Fri, 13 Nov 2015, Vjeran Marcinko wrote: On 13.11.2015 11:51, Nick Burch wrote: On Fri, 13 Nov 2015, Vjeran Marcinko wrote: I saved 2 .eml files saved by my Thunderbird, and one of them contained plain text content, whereas other one rich HTML content. Did you try with the latest version

Re: Detection problem with RFC822 file with HTML content

2015-11-13 Thread Nick Burch
On Fri, 13 Nov 2015, Vjeran Marcinko wrote: I saved 2 .eml files saved by my Thunderbird, and one of them contained plain text content, whereas other one rich HTML content. Did you try with the latest version of Apache Tika? IIRC we did some fixes around this moderately recently Nick

Re: Help needed for special byte collecting input stream

2015-10-19 Thread Nick Burch
On Sun, 18 Oct 2015, Vjeran Marcinko wrote: Well, the problem is that I don't need to collect raw content of every possible file type, just some predefined file types. And some parsed files can be veery large, like some big archives, and I don't want to collect these raw bytes for such file

Re: Tika unable to extract PDF Text

2015-10-15 Thread Nick Burch
On Thu, 15 Oct 2015, Adam Retter wrote: However, java -Dtika.config=/tmp/tika-config.xml -cp /Users/aretter/Downloads/tika-core-1.10.jar:/Users/aretter/Downloads/tika-parsers-1.10.jar:/Users/aretter/Downloads/pdfbox-2.0.0-20151014.234027-1764.jar:/Users/aretter/Downloads/fontbox-2.0.0-20151014.2

Re: Tika unable to extract PDF Text

2015-10-15 Thread Nick Burch
On Thu, 15 Oct 2015, Adam Retter wrote: java -cp /Users/aretter/Downloads/tika-core-1.10.jar:/Users/aretter/Downloads/tika-parsers-1.10.jar:/Users/aretter/Downloads/pdfbox-1.8.10.jar ExtractTest You probably need fontbox and jempbox as well. Ask maven nicely and it'll tell you what the depende

Re: AutoDetectParser bug?

2015-10-14 Thread Nick Burch
On Wed, 14 Oct 2015, Ziqi Zhang wrote: As for bugzilla, I was unable to create a new bug, as it is saying “first you must pick a product…” and there is no tika in the list. Sorry, wrong project - POI uses Bugzilla, Tika uses JIRA, I wasn't paying enough attention! The starting point for repo

Re: Fwd: AutoDetectParser bug?

2015-10-14 Thread Nick Burch
On Wed, 14 Oct 2015, Ziqi Zhang wrote: My apologies, here are the testing files attached. Any chance you could open a bug in bugzilla, and attach these files there? At first glance, it looks like those files have some certain text patterns near the start which is causing them to be mis-detect

Re: tika for touchstone file format

2015-10-08 Thread Nick Burch
On 06/10/15 12:36, Eva Schlauch wrote: Thanks for the nice introduction to apache tika at the Budapest apache:big_data conference! Glad to see we've inspired you to join the community :) I am considering to use apache tika together with apache oodt. Some of the files that are generated here (

Re: whitelist/blacklist

2015-09-22 Thread Nick Burch
On Mon, 21 Sep 2015, Brian Young wrote: I originally wanted to avoid having to specify a new config because I thought that supplying my own tika XML config meant that I had to redefine everything that would be in the default file. However after some testing it appears that, as in your example,

Re: whitelist/blacklist

2015-09-21 Thread Nick Burch
On Mon, 21 Sep 2015, Brian Young wrote: Hello, we are long time Tika users that have recently started using Tesseract. We would like to be able to enable/disable Tesseract per extraction with Tesseract disabled until we choose to enable it. The easiest way would be to have two different TikaC

Re: Can I add custom detector to be called last to parse common containers' subtypes?

2015-09-01 Thread Nick Burch
On Thu, 27 Aug 2015, Mikhail Titov wrote: On Wed, Aug 26, 2015 at 6:11 AM, Nick Burch wrote: You probably shouldn't be defining additional mimetypes to DefaultParser. I had an impression that indeed there should be no explicit definition and new types should be hooked up to a de

Re: Does tika support "HWP"?

2015-09-01 Thread Nick Burch
On Tue, 1 Sep 2015, Mungeol Heo wrote: java -jar tika-app-1.10.jar --list-supported-types | grep hwp application/x-hwp That means the mime type has been defined in some way java -jar tika-app-1.10.jar --detect sample.hwp application/x-tika-msoffice That means that the HWP file is based on t

Re: TikaConfig with constructor args

2015-08-27 Thread Nick Burch
On Thu, 27 Aug 2015, Andrea Asta wrote: This parser needs some configuration to init an external connections. Is there a way to specify the constructor params (or bean properties to set) in the Tika xml format? Nope, not yet. See the thread "Configuring parsers and translators" for some discu

Re: Can I add custom detector to be called last to parse common containers' subtypes?

2015-08-26 Thread Nick Burch
On Tue, 25 Aug 2015, Mikhail Titov wrote: This way no decorators are involved and I can go back till Tika 1.6 so hopefully I would be able to drop my jar into Alfresco as is. Thank you for bearing with me, Nick! Alfresco needs a very old version of ASM, so take care when upgrading Tika Nick

Re: Can I add custom detector to be called last to parse common containers' subtypes?

2015-08-26 Thread Nick Burch
On Tue, 25 Aug 2015, Mikhail Titov wrote: The following will break automatic parser calling for text/toa5 in 1.10 but not in 1.9 ,[ tika config xml ] | | | | | | | | | | | You probably shouldn't be defining additional

Re: Can I add custom detector to be called last to parse common containers' subtypes?

2015-08-25 Thread Nick Burch
On Mon, 24 Aug 2015, Mikhail Titov wrote: On Mon, Aug 24, 2015 at 6:14 PM, Mikhail Titov wrote: While writing a reply, I came to a conclusion that in my particular case I can move all "detection" into a parser code and wrap standard parsers. Is parser decorator the way to go if I want to di

Re: Can I add custom detector to be called last to parse common containers' subtypes?

2015-08-25 Thread Nick Burch
On Mon, 24 Aug 2015, Mikhail Titov wrote: While writing a reply, I came to a conclusion that in my particular case I can move all "detection" into a parser code and wrap standard parsers. I feel like nothing prevents me from changing a content type in metadata from parser code if I really want

Re: Can I add custom detector to be called last to parse common containers' subtypes?

2015-08-24 Thread Nick Burch
On Mon, 24 Aug 2015, Mikhail Titov wrote: This approach however works partly for me. It occurred to me that *detect* method in *CompositeDetector* does not pass-in current *type* while trying detectors and testing for specialization[1]. Thus there is no way for me to know whether detector is be

Re: want to disable tesseract ocr parser

2015-08-20 Thread Nick Burch
On 20/08/15 07:19, Sergey Tsalkov wrote: Then I thought I could pass a custom config.xml to disable it, but I can't figure out how to write the config file. See http://tika.apache.org/1.10/configuring.html#Configuring_Parsers for details of the parser configuration You should be fine with a

Re: How to configure OutlookPSTParser

2015-08-13 Thread Nick Burch
On Fri, 14 Aug 2015, Justin wrote: I replaced:         with:           ...and this works. So I think there's a problem in that CompositeDetector does not behave like DefaultDetector with the same set of detectors. Are you able to write a short junit test case which, when coupled wi

Re: How to configure OutlookPSTParser

2015-08-12 Thread Nick Burch
On 12/08/15 02:07, Justin wrote: ---tika-config.xml--- I do not get anything back from BodyContentHandler when parsing a PST file whereas I do when I use TikaConfig.getDefaultConfig() instead. Am I missing something? Your config file lo

Re: Using Tika for AutoCad .dwg files

2015-08-07 Thread Nick Burch
On Fri, 7 Aug 2015, BOEY Lionel wrote: I recently downloaded and tested the Tika library v1.11 to parse AutoCad .dwg files... Do you mean 1.11-snapshot? Latest is 1.10, and that hasn't hit the mirrors yet Everything runs smoothly but the handler return almost zero data (save for the title o

Re: Can I add custom detector to be called last to parse common containers' subtypes?

2015-08-01 Thread Nick Burch
On Sat, 1 Aug 2015, Nick Burch wrote: If you need full control over the ordering, for now, you need to write some code something like: DefaultDetector d1 = new DefaultDetector(); MyCustomDetector d2 = new MyCustomDetector(); CompositeDetector detecter = new CompositeDetector(d1,d2); Then

Re: Can I add custom detector to be called last to parse common containers' subtypes?

2015-08-01 Thread Nick Burch
On Fri, 31 Jul 2015, Mikhail Titov wrote: Can I force my detectors declared in META-INF/services/org.apache.tika.detect.Detector to go after standard ones? Nope, sorry Using the service loader method, you're basically saying to Tika "make my life easy with the defaults" If you need full con

Re: Overriding built-in parser for TikaCLI with Tika 1.9

2015-07-30 Thread Nick Burch
On Thu, 30 Jul 2015, Stephan Mühlstrasser wrote: Now I tested with Tika 1.9 There was a bug in 1.9 with the parser service loader list logic, can you try with a nightly build? And is there any documentation on the syntax of the Tika XML configuration file? I was able to find some examples o

Re: TesseractOCRParser on Linux

2015-07-23 Thread Nick Burch
On Wed, 22 Jul 2015, Christian Wolfe wrote: It looks to me that TesseractOCRParser doesn't work on Linux unless the Tesseract executable and the 'tessdata' folder are in the same location on the filesystem. This makes sense in a Windows environment (where everything is installed together by def

Re: Per Page Document Content

2015-07-19 Thread Nick Burch
On Fri, 17 Jul 2015, Nazar Hussain wrote: I tested it with different pdf documents and it works 100% perfect. Unfortunately it does not work well with docx format. That's entirely to be expected. Word .doc and .docx formats are run-based not page-based, so there's no page information in the fi

Re: Licensing of Tika

2015-07-16 Thread Nick Burch
On Wed, 15 Jul 2015, Ingo Wiarda wrote: generating a list of all licenses is a good idea. The last thing you want for your product is to discover that the most recent version of a dependency is AGPL'ed, if you plan to publish under another license. I think there are some maven plugins you can

Re: Per Page Document Content

2015-07-15 Thread Nick Burch
On Wed, 15 Jul 2015, Nazar Hussain wrote: @Matt. I am looking for plain text extraction, no css or xpath. I just want to extract text per page. So I would have array of plain text content on which each index have content of a single page. You won't be able to do it in the plain-text space. You

Re: Per Page Document Content

2015-07-15 Thread Nick Burch
On Wed, 15 Jul 2015, Nazar Hussain wrote: Yes in first phase I am targeting PDF and DOC files. Later will use PPT and other but all would be page based documents. .doc is not a page based format, it's a run-based format. There is no page information in the file format, it's calculated on the f

Re: Licensing of Tika

2015-07-15 Thread Nick Burch
On Tue, 14 Jul 2015, Chris Harshman wrote: Personally, I'd conduct a review of each component if license compliance is important to you (e.g., if you're going to release a commercial product incorporating the code). While Apache tries to ensure the software it produces is "commercially friend

Re: Per Page Document Content

2015-07-15 Thread Nick Burch
On Wed, 15 Jul 2015, Nazar Hussain wrote: The problem I am facing is with pages. I can extract total pages from document metadata. But I can't find any way to extract content per page from the document. What file formats is this for? And how are you calling Tika? If the file format is page-ba

Re: Configuring Logging

2015-07-10 Thread Nick Burch
On Fri, 10 Jul 2015, Gabriele Lanaro wrote: Hi, I would like to know if it is possible to configure logging in tika, for example I'd like to log to a file instead of standard output. Redirection is not an option because I'm using tika as a library for another application, and I'd like to log ot

Re: MagicDetector does not enforce mark/reset support in inputstream

2015-06-18 Thread Nick Burch
On Thu, 18 Jun 2015, Satya Deep Maheshwari wrote: Please see [1] which comes into play when detecting the mime-type from content. I think there is an assumption in Tika's MagicDetector that the stream would always support mark/reset. Probably it should check it explicitly and not proceed if tha

Re: Tika-Translate, Clojure, and a new Constructor

2015-05-28 Thread Nick Burch
On Wed, 27 May 2015, Carter, Phillip Michael wrote: I propose expanding the API for the Microsoft and Google Translators to accept configuration details as parameters. This eases development for the iPReS project by allowing us to be responsible for our own configuration details, and allows th

Re: Dynamic content handler

2015-05-26 Thread Nick Burch
On Tue, 19 May 2015, Andrea Asta wrote: I would implement the following scenario: - For HTML pages with a given URL Pattern, extract a part of the page starting from an XPath - For other generic HTML pages I would use Boilerpipe - For different file formats, a simple BodyContentHandler is ok Wh

Re:Re: How to skip encrypted (PDF) documents

2015-05-20 Thread Nick Burch
On Wed, 20 May 2015, YukaChan wrote: btw, I receive the "Java heap space" error when processing one of the doc files, the size of this file is not large, nor does it contain too much content, how can I avoid similar errors? ForkParser can be used to avoid an OOM or similar error from wiping ou

Re: How to skip encrypted (PDF) documents

2015-05-20 Thread Nick Burch
On Wed, 20 May 2015, YukaChan wrote: Currently I am processing a bunch of documents such as MS Office and PDF files, I intend to extract only text out of every document for further analysis. When Tika meets an enceypted document it is stuck and the whole extraction is aborted. Actually, it i

Re: How to customize AutoDetectParser without changing the distribution of Tika

2015-05-19 Thread Nick Burch
On Tue, 19 May 2015, Andrea Asta wrote: Can you please give me more details about the service file? See https://tika.apache.org/1.8/parser_guide.html#List_the_new_parser Nick

Re: [Date Format] Render dates in single format

2015-05-19 Thread Nick Burch
On Tue, 19 May 2015, Alessandro Benedetti wrote: Is there any way to know which metadata are Dates or not ? You can find out if a given property is a date or not. Most of the entries on the Metadata object will these days be properties, as we've been trying to convert Parsers to use typed Pro

Re: [Date Format] Render dates in single format

2015-05-19 Thread Nick Burch
On Mon, 18 May 2015, Alessandro Benedetti wrote: I am interested in understanding if there is any config param in Tika to force the rendering of all dates in a specific format. Independently of the parser. Nope, you'll need to do it on the output side. The parsers will store the dates / date

Re: How to customize AutoDetectParser without changing the distribution of Tika

2015-05-19 Thread Nick Burch
On Tue, 19 May 2015, Andrea Asta wrote: I was wondering if I could customize the AutoDetectParser without changing the Tika jar files. Just add your own parser to the classpath along with a service file I am following the Parser 5 min quick start but can't figure out where to add my new Parse

Re: Tika - tessract integration

2015-03-24 Thread Nick Burch
On Tue, 24 Mar 2015, Kovalan R wrote: I would like to know whether Tika parser and Tesseract can be integrate together? Yes. You'll need to be using Tika 1.7, and you'll need to have Tesseract on your path. https://wiki.apache.org/tika/TikaOCR has more details Nick

Tika at ApacheCon in Austin, 13-16 April

2015-03-05 Thread Nick Burch
Hi All As many of you will hopefully know, the next ApacheCon takes place in April in Austin, Texas. What you may not know is quite how many Tika related talks we have taking place. If you take a look at the schedule, you'll discover there are 6 different talks on or related to Apache Tika t

Re: LanguageIdentifier.isReasonablyCertain is always false

2015-03-04 Thread Nick Burch
On Wed, 4 Mar 2015, Wilm Schumacher wrote: I want to use the language detector for choosing the stemming in my full text search engine. My plan was to use the specific stemmer (e.g. "german2") if getLanguage returns "de". However, as getLanguage always returns something, e.g. "lt" for the conte

Re: Config for Tika Windows Service with Apache Commons Daemon

2015-03-03 Thread Nick Burch
On Wed, 4 Mar 2015, Jason wrote: I can get Tika started as a service, but I can't determine what to use for a stop method. There isn't really a stop method. As it stands, the Tika Server runs in a single process, started from the main method. To close it down, send it control+c or a kill sign

Re: File extension for application/gzip

2015-02-28 Thread Nick Burch
On Sat, 28 Feb 2015, Adam Lamar wrote: I'd appreciate a change of the default! Best bet would be to open a jira for this, then the change can be tracked and will have a jira id Every tgz is application/gzip, but not every application/gzip is a tgz. Also, it seems to me that the parsers shou

Re: File extension for application/gzip

2015-02-28 Thread Nick Burch
On Sat, 28 Feb 2015, Adam Lamar wrote: MimeType mimeType = config.getMimeRepository().forName("application/gzip"); When I call mimeType.getExtension(), the returned value is ".tgz". That mime type has multiple extensions defined, with .tgz just the first. I wonder if it would be worth changin

Re: Add custom mime type programmatically

2015-02-09 Thread Nick Burch
On Sun, 8 Feb 2015, silvercha...@fastmail.com wrote: I'm trying to add a custom mime type. I've seen solutions that involve writing a custom-mimetypes.xml file, but I'd really prefer to add my custom type programmatically. Currently, I think we only support loading magic from a combination of

Re: Advice on parsing Spreadsheets and preserving cell positions

2015-02-04 Thread Nick Burch
On Wed, 4 Feb 2015, Matt Bachmann wrote: When I play with the TIKA jar file with a simple excel file I get something like what I have below. Code I write to do the parsing pulls out something similar. The data is generally correct. But, in the parsing the position of cells is completely lost.

Re: Question .. how NOT to skip empty paragraphs

2015-01-30 Thread Nick Burch
On Thu, 29 Jan 2015, lgilardon...@gmail.com wrote: Extracting plain text from word this empty paragraphs are completely removed (albeit they stay in the xhtml representation). Any suggestion for preserving this empty paragraphs - in the extracted string they would appear as double \n\n - witho

Re: Unable to extract body of emails from Outlook PST file using OutlookPSTParser in SOLR

2015-01-15 Thread Nick Burch
On Wed, 14 Jan 2015, Anton Shokhrin wrote: I’ve setup my SOLR instance to index Outlook PST files with OutlookPSTParser (via SOLR’s TikaEntityProcessor). I can see that the SOLR is receiving and indexing email message's related meta data like message unique id and subject but the body of the me

Re: External custom-mimetypes.xml

2015-01-13 Thread Nick Burch
On Tue, 13 Jan 2015, Luís Filipe Nassif wrote: And Nick, any alternative to this package like path? I have used it before but I think it is a bit unfriendly to be edited by the final user. It needs to be something unique like that, so we don't get accidental false positives. For production

Re: External custom-mimetypes.xml

2015-01-13 Thread Nick Burch
On Tue, 13 Jan 2015, Luís Filipe Nassif wrote: I would like to load a custom-mimetypes.xml file from a directory, not from the jar files. Is it possible? Yup, I do it quite often when testing Just make sure that you have a directory structure like: somewhere/org/apache/tika/mime/custom-mim

Re: Parsing PDF files

2014-12-24 Thread Nick Burch
On Wed, 24 Dec 2014, A.M. Sabuncu wrote: I am following the examples at http://wiki.apache.org/tika/TikaJAXRS and using the following curl command to test text extraction from PDF files: curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header "Content-type: application/pdf" What happ

Re: Tika 2.0???

2014-12-18 Thread Nick Burch
On Thu, 18 Dec 2014, Allison, Timothy B. wrote: I feel Tika 2.0 coming up soon (well, April-ish?!) and the breaking of some other areas of back compat, esp. parser class loading -> config ... Is it work creating a wiki page, and use that to track things we want to break compatibility with + ba

Re: Setting tesseract properties when using tika-server

2014-11-16 Thread Nick Burch
On Sat, 15 Nov 2014, David Meikle wrote: The OP is using the Tika Server though. I guess we'd need to allow for an extra header in the server to get this set on the context used in the server's parsing? We could do something like this to allow users to set the language per request - I am usin

Re: Setting tesseract properties when using tika-server

2014-11-15 Thread Nick Burch
On Sat, 15 Nov 2014, David Meikle wrote: How can i do that? You can set this using the TesseractOCRConfig class. It has a property called language which can be set to a + separated list of supported language models (i.e. the ones you have installed with your Tesseract installation) using th

Re: Setting tesseract properties when using tika-server

2014-10-30 Thread Nick Burch
On Thu, 30 Oct 2014, Milos Kovacevic wrote: I am using tika-server-1.7-SNAPSHOT.jar which incorporates tesseract ocr engine. I am curious how can i set different tesseract parameters such as default language or output format (hOCR) in a separate request to tika server? I believe they can only b

Re: How to add Parser to existing DefaultParser object

2014-10-24 Thread Nick Burch
Did you try following the approach on the "5 minute new parser guide" on the website? https://tika.apache.org/1.6/parser_guide.html Nick On Fri, 24 Oct 2014, Karol Abramczyk wrote: Hello, I’m using Apache Tika to parse different documents formats in my application. I created custom CSV Pars

Re: Tika 1.6 update in Maven Central?

2014-10-21 Thread Nick Burch
On Tue, 21 Oct 2014, Aeham Abushwashi wrote: I'm trying to determine whether I can wait for a 1.7 release. If not, I think my only option to avoid the uncontrolled build up of tmp files (when processing .7z archives) would be to go back to 1.5. What about using a nightly build / building it yo

Re:Re: proceed with the limitation of character length

2014-10-14 Thread Nick Burch
On Tue, 14 Oct 2014, imyuka wrote: I suppose I'm calling Tika with parse+content handler, the following code is one example I found on the internet: ContentHandler handler = new BodyContentHandler(); If you look at the JavaDocs for BodyContentHandler: https://tika.apache.org/1.6/api/org/apach

Re: External parser

2014-10-14 Thread Nick Burch
On Tue, 14 Oct 2014, Kamil Żyta wrote: You'd basically need to do something like java -classpath tika-app.jar:. org.apache.tika.cli.TikaCLI http://pastebin.com/wSgwFva3 Key there is Can you try asking the Tika CLI what it detects your file as, and what parsers it thinks are present? N

Re: External parser

2014-10-14 Thread Nick Burch
On Tue, 14 Oct 2014, Kamil Żyta wrote: project conventions I'm not familiar with java. So I need to create org/apache/tika/parser/external/ dir and copy there tika-external-parsers.xml? + add that to your classpath, so java finds it. (Long term putting it in a jar might be best) tika-cor

Re: External parser

2014-10-14 Thread Nick Burch
On Tue, 14 Oct 2014, Kamil Żyta wrote: On Tue, Oct 14, 2014 at 12:32:52PM +0100, Nick Burch wrote: On Tue, 14 Oct 2014, Kamil Żyta wrote: All you should need to do is provide a tika-external-parsers.xml file on your classpath (in the appropriate directory), which defines how to talk to your

Re: External parser

2014-10-14 Thread Nick Burch
On Tue, 14 Oct 2014, Kamil Żyta wrote: I want to use external parser but on web there isn't complex howto/tutorial. I only found parser/external/tika-external-parsers.xml sample configuration but I don't know how to register/enable this parser in tika parsers. All you should need to do is pro

Re: proceed with the limitation of character length

2014-10-14 Thread Nick Burch
On Tue, 14 Oct 2014, imyuka wrote: In these cases, how can I increase the limit or retrieve only the first 10 characters of the document without throwing an exception? Depends largely on how you're calling Apache Tika? (The answer differs depending on if you're using the Tika CLI (app), Ti

Re:Re: Formatted Content Extraction and Title Detection

2014-10-09 Thread Nick Burch
On Thu, 9 Oct 2014, imyuka wrote: I roughly checked up the book and found the instruction about transforming a document to a XHTML file with command line, while I have no idea about the Java coding implementation. Are there any instructions or tutorials I can refer to? We have quite a few ex

Re: Formatted Content Extraction and Title Detection

2014-10-09 Thread Nick Burch
On Thu, 9 Oct 2014, imyuka wrote: Here is my problem: I have extracted plain texts from a serious of doc(x) documents and their titles via the "dc:title" label of metadata, but I'm not sure this is the right way to attain a title of a document. In many cases, a title inside a document could be

Re: Customizing Metadata Keys

2014-10-09 Thread Nick Burch
On Wed, 8 Oct 2014, Can Duruk wrote: My question is regarding setting the metadata keys coming from the parsers to my own keys. For my application, I am using Tika to extract the metadata for a bunch of files. I am using the embedded HTTP server which I modified for my needs to return instead of

Tika at ApacheCon Europe - 2 months time!

2014-09-22 Thread Nick Burch
Hi All It's only 2 months to go until ApacheCon Europe in Budapest. I'm simultaneously exciting by all the great Tika stuff going on, and worried by how many talks I need to finish writing... As usual for an ApacheCon, we've a number of talks about Tika going on, and almost certainly a hacka

Re: very large xml-file parsing

2014-09-13 Thread Nick Burch
On Sat, 13 Sep 2014, Mugat Gurkowsky wrote: i am trying to use tika in combination with lucene to parse and index of very large xml-files. so far, without success, because of memory limitations. tika's BodyContentHandler seems to try to copy the whole content in memory, which doesn't work as fi

Re: Tika versions compatibility

2014-09-02 Thread Nick Burch
On Tue, 2 Sep 2014, Baldwin, David wrote: Is there any information I have not found googling around and searching the page that may show any changes from 0.6 to the current 1.5 version that may make it incompatible on the API/Usage level? We've strived to maintain backwards compatibility, so y

Re: Compression of Tika server output files

2014-08-07 Thread Nick Burch
On Thu, 7 Aug 2014, Bratislav Stojanovic wrote: This was exactly what I was afraid of...you see, I have to extract thousands and thousands of documents and calling java command *three times* for each of them is highly inefficient. The Tika App is largely intended for testing, debugging, demos

Re: Compression of Tika server output files

2014-08-07 Thread Nick Burch
On Thu, 7 Aug 2014, Bratislav Stojanovic wrote: Hmm, I apologize, but I'm afraid this does not work. If you specify : *java -jar tika-app-1.5-SNAPSHOT.jar --text --metadata --extract --extract-dir=out example.doc* ...it will only extract attachments, not everything (text + meta + attachments).

Re: Compression of Tika server output files

2014-08-07 Thread Nick Burch
On Thu, 7 Aug 2014, Bratislav Stojanovic wrote: OK, but I don't really have to use http...does tika support extracting all resources in one call by some other method? The Tika App does - the -z / --extract will do that. You might also want to use the --extract-dir= flag to set where they go

Re: Running tika-server in background?

2014-08-07 Thread Nick Burch
On Thu, 7 Aug 2014, Bratislav Stojanovic wrote: Is there any way to run tika-server in background? I want to start this command and immediately continue execution : *java -jar target\tika-server-1.5-SNAPSHOT.jar* I've tried with adding & at the end, but no luck (Win7 x64 command prompt). Thi

Re: Compression of Tika server output files

2014-08-07 Thread Nick Burch
On Thu, 7 Aug 2014, Bratislav Stojanovic wrote: Yes, GZIP compression will do the job for me...but having a plain folders and files as an output is even better. How complicate is to update/add option in tika source to output folders and files directly without packing it into any file format?

Re: AW: AW: Determine binary pdf?

2014-07-23 Thread Nick Burch
On Wed, 23 Jul 2014, Clemens Wyss DEV wrote: Is it possible to tell tika to extra only first n pages? Not currently. Very few file formats (PDF being only of the few) actually encode the page info, so most formats Tika supports there's no way to restrict by page count Nick

Re: How to identify a language of text

2014-07-23 Thread Nick Burch
On Wed, 23 Jul 2014, Avi Hayun wrote: How many languages does Tika support? You can see the list of supported languages in svn: http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/language Where can I find more information about it ? There's a tiny bit at

Re: AW: Determine binary pdf?

2014-07-22 Thread Nick Burch
On Tue, 22 Jul 2014, Clemens Wyss DEV wrote: I have thousands of pdf's that are extracted using tika and then indexed/analyzed in Lucene. An there seems to be "cryprtic" text (binary data?) in some of the pdfs. Are you able to identify a small pdf (ideally sub 100kb) which shows the problem?

Re: Avoiding Out of Memory Errors

2014-07-18 Thread Nick Burch
On Thu, 17 Jul 2014, Shannon Brown wrote: Problem: How to avoid Out of Memory errors during Tika parsing. Typical approaches are either to use the ForkParser, or the Tika Server. Both ensure that if there's a fatal problem with parsing (eg OOM) then the JVM with your main application in it do

<    1   2   3   4   5   >