Re: Unexpected behavior when inspecting mp4 files with different ISO

2024-04-27 Thread Nick Burch

On Fri, 26 Apr 2024, Mauler, David wrote:
I'm in the process of troubleshooting an issue with certain mp4 video 
files and tika. After a bunch of digging, it appears to be related to 
whatever ISO is set for the mp4 file. An mp4 with an ISO of 
14496-12:2003 will be detected as video/quicktime but an mp4 with an ISO 
of 14496-14 is detected as video/mp4 which is what I was expecting for 
both files.


Depends where in the file the type box lives. At the moment, we only have 
mime-magic based detection for the Quicktime / MP4 family of formats. If 
the right box in the container is at the start we're ok, if it comes later 
we can't tell with just a mime magic signature


What we really need is a container-aware detector for the file format, 
similar to what we have for Zip files, and for the Ogg family. That would 
properly process the file in a format-aware way, checking for the contents 
to correctly identify the type.


The long-standing issue is https://issues.apache.org/jira/browse/TIKA-2935
- do you have a few days of spare coding time you could put towards this, 
and/or a bit of budget to sponsor someone to?


Thanks
Nick


Re: PST file parsing

2023-11-29 Thread Nick Burch

On Wed, 29 Nov 2023, Neha Kamat via user wrote:
We are currently using TIKA for parsing/extracting content from pst 
files.Is there a way we can tell parsing engine to parse as list of 
emails instead of string of emails?


Depends how you're calling Tika?

Tika App? Tika Server? Python Wrapper? Java via the Tika class facade? 
Java direct to parsers? Tika Batch?


Nick


Re: Using Tika with another OCR engine

2023-08-08 Thread Nick Burch

On Thu, 3 Aug 2023, Cristian Zamfir wrote:
I am interested in trying out Tika with a different OCR engine and 
wondering how Tesseract is integrated.


Largely as "just another parser", but IIRC with a bit of logic to allow 
the "normal" image parsers to also have a go at the file to grab metadata


It's all in tika-parser-ocr-module:
https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module


Is it possible to write a plugin to call a different engine?


Largely would be a case of writing your own parser, registering it for the 
appropriate mime types, and disabling the Tesseract one if you have the 
tesseract binary on your path


for scanned PDFs, I assume there is some bi-directional communication 
between Tika and Tesseract to detect inline images. Is that correct?


Nope, the PDF parser will detect any embedded resources (eg images), and 
if enabled will call the appropriate parser for each one


Nick


Re: TIKA for MIME type detection

2023-07-27 Thread Nick Burch

On Tue, 20 Jun 2023, Neha Kamat via user wrote:
I am currently working on an application wherein I would like to 
whitelist the filetypes supported by TIKA And discard rest of the files 
to avoid unknown behaviour/memory leaks. I am currently referring to 
https://cwiki.apache.org/confluence/display/TIKA/File+Types+and+Dependencies.


You may be better off using the Tika App or Tika Server options which will 
let you see which mime types each parser claims, which parsers you have 
available, and how the mime types relate to each other (more info 
available via Java API too)


That way you can check exactly what mime types your install supports, how 
they relate to each other, the impact of disabling parsers via the config 
file etc


Nick


Re: Run Tika-docker with custom config

2023-04-28 Thread Nick Burch

On Fri, 28 Apr 2023, שי ברק wrote:
I don’t know if it’s possible but I’m trying to avoid typing this ‘ –– 
config’ when I start the container. I wish to have all of these settings 
to be written inside the Dockerfile.


Since you're doing your own custom docker container, you could override 
the ENTRYPOINT to specify the Tika Config file by default

https://github.com/apache/tika-docker/blob/master/full/Dockerfile#L77

Nick

Re: Run Tika-docker with custom config

2023-04-28 Thread Nick Burch

On Fri, 28 Apr 2023, שי ברק wrote:

Inside the container probably - makes more sense to me


In that case, create a custom Docker container that adds in your custom 
config to your Docker image, as per Konstantin's instructions:

https://lists.apache.org/thread/l0od2b6tp6odyd661ftjqmkkf27o6hdl

Then when you start your Docker container, tell it the path to your custom 
config file within the container using the --config flag



See also https://github.com/apache/tika-docker#custom-config for how to 
have your config outside on the host machine and to mount that in, in case 
you decide to go the other way


Nick

Re: Tika incorrectly detecting Canon raw image file .cr3 as video/quicktime

2023-03-22 Thread Nick Burch

On Wed, 22 Mar 2023, Tim Allison wrote:

Thank you, Richard, for raising this.  In looking at these file
formats, it looks like crw is based on ciff, cr2 is based on tiff and
cr3 is based on quicktime.


Always fun when the core of a format (or at least the container) swaps 
between versions!



For some file formats we do, application/x-this-app; version=1.0,
application/x-thisapp; version=2.0.  For others, we create separate
main mimes as you've done


A lot of the ";version=x.y" ones are where the structure is pretty 
similar, but you can spot the version from the mime type


A few do have very different structures, and hence parent types. 
image/vnd.dgn is the main one like that


Otherwise, we / the format authors give them whole different mime types 
for the different versions. application/vnd.ms-excel vs 
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet being 
one such example



In the absense of any official mime types, I'd say the "best" will depend 
on if most programs handle all 3 versions pretty much the same, or if most 
programs only do 1-2 of them and don't handle the others / handle them in 
very different ways



Another option would be to subtype cr2 to crw and cr3 to crw, but then 
add cr2 to a supported format in our TIFFParser and cr3 to our 
mpeg/quicktime parser.


Did we ever get round to adding a quicktime / mp4 detector? Ideally we'd 
have something that checks for some well known atoms, and identify the 
specific subtype that the container holds from that.




I think we should improve our detection of these at the very least.  I
found some examples for cr2, if we can get examples for crw and cr3,
that'd be helpful.  The dropfiles link isn't working for me at the
moment. :(


I have a friend with a fancy canon camera that ought to be able to 
generate all of these, but he's a bit busy putting on a theatre production 
this week... Will try to get some in a week or two!




Some useful links (I want to document these for me.  You probably
already know them!)
[0] https://exiftool.org/canon_raw.html


Using our various IO utils, looks like writing a parser for the original 
format wouldn't be too much work. If Richard feels like contributing... :)


Nick


Re: Best practice for extracting content and metadata repeatedly

2023-03-06 Thread Nick Burch

On Mon, 6 Mar 2023, Chris Bamford via user wrote:
From both performance and thread safety points of view what is the best 
approach for the use / reuse of the following objects:


Tika
ParseContext
Parser
Metadata


The Tika object and/or TikaConfig object should only be created once and 
then re-used. Same for any Parser or Detector instances


ParseContext ought to be fine to re-use, but it's such a light-weight 
thing I normally create one fresh.


Metadata is normally created from scratch each time - all the entries need 
to be removed and recreating is typically a lot less work then removing 
everything



Depending on how untrustworth / variable the input you're receiving is, 
and the impact of an OOM or similar, you might want to look into using the 
Tika Server or Fork Mode or Batch Mode.


Nick


Re: Subset(s) of Tika?

2023-01-05 Thread Nick Burch

On Thu, 5 Jan 2023, Georg.Fischer wrote:
The tika.jar has >54 MB, and I suspect that the loading of the big jar 
(under Windows) is hindering the performance. I should perhaps move to 
Linux, or try the Tika server.


The Tika App jar has always been the "kitchen sink included quickstart" 
option


The Tika java library, and the Tika Server both support including or 
excluding groups of file format parsers


I used a recent tika.jar on the Windows 10 commandline to extract text 
from some 30 PDF files, with a makefile converting one file per command. 
That was quite successful, but it took some time, and the approach will 
perhaps not be appropriate for 300 or 1000 PDFs.


For a folder of files, you might be better off with Tika Batch, which is 
aimed at batch processing a large number of files. It can respawn failed 
child processes, doesn't require starting a JVM every file etc


Otherwise, the Tika Server is a good option. If you're doing everything 
locally, turn on "-enableUnsecureFeatures -enableFileUrl" and then you can 
pass it a file path to process (but not on a publically available 
machine!)


Nick


Re: Paragraph words getting merged

2022-10-31 Thread Nick Burch

On Sun, 30 Oct 2022, Christian Ribeaud wrote:
I am using the default configuration. I think, we could reduce my 
problem to following code snippet:


Is there a reason that you aren't using one of the built-in Tika content 
handlers? Generally they should be taking care of everything for you with 
paragraphs, plain text vs html etc


Nick


Re: Custom Parser Plugin for Tika Server

2022-10-26 Thread Nick Burch

On Wed, 26 Oct 2022, Tim Allison wrote:

I've been struggling with this too.  Outside of Docker, what I've been
doing is using a bin/ directory and throwing everything in there and then
starting tika-server: java -cp "bin/*"
org.apache.tika.server.core.cli.TikaServerCli ...

If we moved to that model in our Docker container, then you could start
with ours and then add your jar to the bin/ directory and be done.


I think that ought to work fine for Docker - extend the core image and add 
in your custom jar, then start and it'll see the built-in Server jar plus 
the custom one


Nick


Re: Validate MIME-type

2022-09-29 Thread Nick Burch

On Thu, 29 Sep 2022, Peter Conrad wrote:

thanks. That's definitely an improvement. But I think it's not
sufficient.

AFAICS your code uses "aliases" as in "if it's type X then it can also
be type Y". However there's also cases where a specific instance of
type X can also be type Y but not all instances of type X. For example,
the eicar.com antivirus test file is a MSDOS-executable consisting
purely of ASCII characters, so it would be valid text/plain AND
application/x-msdownload but clearly neither all text/plain's are valid
application/x-msdownload's nor vice versa so there can't be an alias
connecting the two.


Any chance you could write up a bit more about what you're trying to 
achieve, and what you're trying to protect against?


It's ApacheCon next week, and we may be able to get a few of us together 
in-person to brainstorm what's possible in this area


Thanks
Nick


Re: Tika documentation?

2022-09-01 Thread Nick Burch

On Thu, 1 Sep 2022, Mark Kerzner SHMsoft, Inc. wrote:

Yes, please. If I make some changes, I will start with small ones. I will
also verify them with you.


Great, thanks in advance for your contributions!

Can you please head to https://cwiki.apache.org/confluence/display/tika/ , 
click Sign Up in the top right, then let us know the username of your 
account once created? We can then issue you permissions


Thanks
Nick


Re: Datasets for testing large number of attachments

2022-07-26 Thread Nick Burch

On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
I am currently trying to validate our Tika setup and was looking for a 
set of example data I could use


If you want a small number of files of lots of different types, the test 
files in the Tika source tree will work. Main set are in

tika-parsers/src/test/resources/test-documents/

If you want a very large number of files, then the Tika Corpora collection 
is a good source. We have a few different collections, including stuff 
from common crawl, govdocs and bug trackers. If you can let us know what 
sort of file types and how many, we can suggest the best corpora 
collection


Nick


Re: Custom filter

2022-06-03 Thread Nick Burch

On Fri, 3 Jun 2022, Cihad Guzel wrote:
I want to pass the content's words through some filters while parsing in 
Tika. How can I add custom filtering?


Does the content handler work for this? Is there a document about this?


A custom content handler is a pretty good way to do that. Tika just uses 
regular Java XML content handlers, so you don't need a Tika-specific 
tutorial on writing one


Depending on what you're wanting to do, you can use Tika's 
TeeContentHandler to send the events to both your custom handler and a 
normal one. ContentHandlerDecorator can also be used to override just some 
bits


Nick


Re: ForkParser issues with 2.3.0

2022-04-26 Thread Nick Burch

On Tue, 26 Apr 2022, Stephen H wrote:

On 26/04/2022 12:22, Nick Burch wrote:
Are you able to write a short junit unit test case which shows this issue? 
We have a bunch of small test OOXML and ODF files that could be used


I've done this - if I create an issue in Jira with it would that best?


Yup!

There isn't currently an ODS file in the parsers test set with a title though 
there is one with a creator, which is enough to show the issue. I also 
couldn't see an MP4 video file that had any metadata in it.


If you have / could create a tiny MP4 file with some metadata, that'd be 
great!


Not sure if it's enough to trigger the OpenDocument bug, but there's a 
title in this test file:

tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-miscoffice-module/src/test/resources/test-document/testOpenOffice2.odf

Nick


Re: ForkParser issues with 2.3.0

2022-04-26 Thread Nick Burch

On Tue, 26 Apr 2022, Stephen H wrote:
Second, there seems to be some work missing in the handling of metadata 
from certain parsers when using ForkParser. For example, for 
OpenDocument ODP and ODS files and Microsoft Open XML formats, while the 
document text is returned there is no metadata in either the returned 
Metadata object or in the returned HTML head. The OpenDocument ODT 
format works as expected via ForkParser though.


Are you able to write a short junit unit test case which shows this issue? 
We have a bunch of small test OOXML and ODF files that could be used


Nick


Re: Returning file extension alongside mime-type?

2022-03-11 Thread Nick Burch

On Tue, 8 Mar 2022, Willy T. Koch wrote:

That’s fantastic, thank you!

Looking forward to testing when the Tika Docker repo is updated with 
this release.


That may take a few weeks, but if you don't mind building Tika from 
source, you should be able to give it a whirl now. (As far as I'm aware, 
we don't build the docker image from snapshots)


If you checkout Tika from source - 
https://tika.apache.org/contribute.html#Source_Code - and build the 
project with maven, you should then be able to go to the tika-docker/full 
directory and build the docker image locally


Nick



Re: Returning file extension alongside mime-type?

2022-03-07 Thread Nick Burch

On Fri, 18 Feb 2022, Willy T. Koch wrote:

Den Tor 17 feb 2022, kl. 20:00, skrev Nick Burch:

Tika devs - any thoughts on this? It's a pretty small code change (we
already have the data on the mime type!), just need feedback on extending
the existing API vs adding a new one


By also returning the default/most commonly used file extension, Apache 
Tika in Docker will be the perfect security companion for SaaS 
solutions.


To be able to verify all files before they are archived will prevent 
different errors down the line, like with PDF conversion and document 
production.


OK, this is now implemented. Should be in 2.3.1 or 2.4, whatever the next 
release is.


You will need to make an additional request to 
/mime-types/{type}/{subtype} eg /mime-types/application/cbor to get the 
full details on the type. You ought to be able to cache that though in 
case it helps.


See https://issues.apache.org/jira/browse/TIKA-3694 for a bit more detail 
and the example JSON you'll get


Nick


Re: Returning file extension alongside mime-type?

2022-02-24 Thread Nick Burch

On Thu, 24 Feb 2022, Tim Allison wrote:

A separate endpoint, then?  That would be cleaner.


We already have some mime details related endpoints, would be an extension 
or related endpoint to those, see earlier-thread:

https://lists.apache.org/thread/jlym8ypnrj978hmzjgvkc1fpxnc7g51h

Nick


Re: Returning file extension alongside mime-type?

2022-02-24 Thread Nick Burch

On Tue, 22 Feb 2022, Tim Allison wrote:
I guess the question is how far do we want to bake this in?  I could see 
adding a field for the default extension in the 
CompositeDetector/DefaultDetector.  This would then be triggered on 
embedded files, too.  I can't imagine this would add much cost 
computationally(???), and it would just show up for free all over the 
place.


Ah, I thought this would be something that required two API hits. Having 
done your detection or parsing, you'd then query a mimetype related API to 
get extra details on the type you were told your file was.


You could also pre-check types you think you'd be interested in, or grab 
all the details on all the types, if you so wanted


Nick


Re: Returning file extension alongside mime-type?

2022-02-17 Thread Nick Burch

On Thu, 10 Feb 2022, Nick Burch wrote:

On Thu, 10 Feb 2022, Willy T. Koch wrote:

…and calling it as a webservice with Postman/curl.


Ah, I think we might not be exposing the full details of the mime types via 
the server, only details of their parsers and the heirarchy, eg

http://localhost:9998/mime-types#audio/vorbis

(We have that info in Java we're just seemingly not making it available)


I'm not sure about exposing all the details of all the types by default, 
but adding a flag and/or a sub-endpoint that would return the full 
details of a type, including extensions and comments etc, seems OK to 
me. Thoughts anyone?


Tika devs - any thoughts on this? It's a pretty small code change (we 
already have the data on the mime type!), just need feedback on extending 
the existing API vs adding a new one


Nick

Re: Returning file extension alongside mime-type?

2022-02-10 Thread Nick Burch

On Thu, 10 Feb 2022, Willy T. Koch wrote:

…and calling it as a webservice with Postman/curl.


Ah, I think we might not be exposing the full details of the mime types 
via the server, only details of their parsers and the heirarchy, eg

http://localhost:9998/mime-types#audio/vorbis

(We have that info in Java we're just seemingly not making it available)


I'm not sure about exposing all the details of all the types by default, 
but adding a flag and/or a sub-endpoint that would return the full details 
of a type, including extensions and comments etc, seems OK to me. Thoughts 
anyone?


Nick

Re: Returning file extension alongside mime-type?

2022-02-10 Thread Nick Burch

On Thu, 10 Feb 2022, Willy T. Koch wrote:
As for content detection, today the content-type field with mime type is 
returned. What we would need is a mime-type to file extension lookup and 
it seems logical that this was also returned by Tika.


How are you calling Tika? We already have APIs for this. Just ask the 
MimeTypes class (available via TikaConfig.getMimeRepository) about a type, 
and it'll return the details including the preferred extension and other 
possible well-known extensions


Nick


Re: Tika 2.1.0 pdf parser

2021-10-21 Thread Nick Burch

On Thu, 21 Oct 2021, nskarthik wrote:
Question :  Need to extract Text / images at page level using java. 
Did not find any example on www or Tika website.


For PDF, you should fetch the contents as XHTML rather than plain text. 
You can then split on the page divs. This isn't available for formats 
which aren't page-based, but luckily PDF is


Depending on what you want to do, it might make sense to write a custom 
ContentHandler which works a lot like the ToTextContentHandler in Tika, 
but which starts writing to a new text buffer each time it hits the event 
for a new page


Nick


Re: Deleted text in Word document

2021-08-27 Thread Nick Burch

On Fri, 27 Aug 2021, Peter Kronenberg wrote:
When Tika extracts from a Microsoft Word document, deleted text is 
extracted, with no indication that it is deleted.  In fact, if a word 
was deleted and replaced by another word, both words just show up 
side-by-side.  Is there a way to get some sort of annotation that 
indicates the status of the text?  Or extract it in some sort of 
structured (e.g., XML) format?


How are you calling Tika? Is the XHTML output sufficiently marked-up to 
let you spot it?


Nick


Re: dcterms:created date changes on RTF documents

2021-07-22 Thread Nick Burch

On Thu, 22 Jul 2021, David Pilato wrote:

TL;DR: the created date of the document changes depending on the timezone.


That does seem a bug


For example:

• Asia/Sakhalin gives dcterms:created=2016-07-06T23:38:00Z
• Asia/Colombo gives dcterms:created=2016-07-07T05:08:00Z
• Europe/Stockholm gives dcterms:created=2016-07-07T08:38:00Z


As a general rule, if we know the timezone, we should be returning it, or 
taking acount of it. If the file format doesn't store the timezone, we 
should be returning a datetime without any timezone specified


I don't know if it's a bug or expected. May be the RTF format does not 
specify the Timezone.


If there's no timezone in the format, there shouldn't be a timezone (eg Z 
for UTC) in the output


Any chance you could report a bug in JIRA, and upload a small sample file 
showing the problem and a small unit test demonstrating it?


Thanks
Nick

Re: logging formatter configuration compatible with StackDriver

2021-06-11 Thread Nick Burch

On Fri, 11 Jun 2021, Cristian Zamfir wrote:

I think for most people it would be quite critical to have logs working. Do
you happen to know how I can reach out to the person maintaining the docker
images https://hub.docker.com/u/dameikle to see if they are available to
update the images? Sounds like it is mostly
https://hub.docker.com/u/dameikle


Paging our very own Dave Meikle!

Nick


Re: --header "X-Tika-OCR: false" ; an option to fully disable OCR for each request

2021-06-10 Thread Nick Burch

On Thu, 10 Jun 2021, Cristian Zamfir wrote:
Got it, thanks. What are your thoughts on using Tika 2.x while still in 
beta? Is it likely to be more stable than 1,26? I presume it has passed 
the same extensive test suite.


Usage stability wise, it's as good as 1.x.

API stability wise things are still changing, based on user feedback, but 
I think we're almost at the point of freezing everything for 2.0 final.


Nick


Re: --header "X-Tika-OCR: false" ; an option to fully disable OCR for each request

2021-06-10 Thread Nick Burch

On Thu, 10 Jun 2021, Cristian Zamfir wrote:
Thanks Nick. Looks like the option I was looking for is the 3rd one, but 
the docs say it is only available in Tika 2.x - am I right?


I've just done a grep of the codebase, and it isn't in the 1.x branch, 
only main = 2.x. So, Tika 2.x only


Nick


Re: --header "X-Tika-OCR: false" ; an option to fully disable OCR for each request

2021-06-10 Thread Nick Burch

On Thu, 10 Jun 2021, Cristian Zamfir wrote:
It would be nice if this was feasible via the headers of each request. I 
find it more convenient to use if/else in my code than in the yaml files 
used for k8s configuration. Is there such an option?


Three options, see 
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-DisableOCRinTikadisable-ocr

 * Don't install tesseract on the machine hosting Tika
 * Supply a Tika Config file that disables the Tesseract parser
 * Send the Server the custom header X-Tika-OCRskipOcr: true

Nick


Re: best practices for avoiding OOM for tika docker

2021-06-02 Thread Nick Burch

On Wed, 2 Jun 2021, Cristian Zamfir wrote:

1. Do you have a recommendation for a stress test that would allow me to
easily test OOM behavior?


Depends what kind of OOM you're interested in. If you fire a lot of 
memory-hungry documents at a single server at once, you can trigger an 
OOM. Alternately, if you send Tika one broken or nearly-broken document, 
you can trigger an OOM. We try to fix cases of the latter when we can, but 
they still occur, once of those is the easiest if you find one



2. For implementing a health check that detects when Tika is stuck, I could
periodically send a simple request and check that the reply is correct, do
you recommend a better approach?


Would /version work for you? Server needs to be up to reply, but it's a 
very lightweight endpoint


Nick


Re: best practices for avoiding OOM for tika docker

2021-05-28 Thread Nick Burch

On Thu, 27 May 2021, Cristian Zamfir wrote:

I am running some stress tests of the latest tika server docker (not
modified in any way, just pulled from the registry) and seeing that after a
few hours I see OOM in the logs. The container has a limit of 4GB set in
K8S. I am wondering if you have any best practices on how to avoid this.


Hopefully one of our Tika+Docker experts will be along in a minute to help 
advise!


For now, the general advice is documented at:
https://cwiki.apache.org/confluence/display/TIKA/The+Robustness+of+Apache+Tika

Also, which version of Tika are you on? There have been some contributions 
recently around monitoring the server, which you might want to upgrade 
for, eg TIKA-3353


Nick


Re: Tika Docker licence

2021-04-17 Thread Nick Burch

On Sat, 17 Apr 2021, Lewis John McGibbney wrote:

Please point me to the code for the ‘ttf-mscorefonts-installer’.


The bit of the Tika docker file that pulls them in is:
https://github.com/apache/tika-docker/blob/master/full/Dockerfile#L21

I think the EULA (which we auto-accept during installation) is
http://corefonts.sourceforge.net/eula.htm

They're certainly not Open Source or Free. Assuming that's the right 
license though, there's no restrictions on use, nothing anti-commercial or 
anything like that


Nick

Re: Tika Docker licence

2021-04-16 Thread Nick Burch

On Tue, 13 Apr 2021, Subhajit Das wrote:
The Tika Docker image (full) uses ‘ttf-mscorefonts-installer’. The 
licence used by it is Microsoft licence and dosen’t seems to allow 
commercial use.


Can any please confirm if it is ok to use? Or should a customized 
version to be used for production?


Licensing of docker images can be complex... There's the licenses of each 
image layer's dockerfile, the licenses of the things those image layers 
pull in, and possibly a license for the resulting image.


Depending on if you publish a certain layer, or just use it locally, the 
distribution clause in a lot of licenses may or may not get triggered. Not 
all docker image hosting services fully comply with all license terms, eg 
providing the source for hosted GPL binaries. It's complicated, and fairly 
easy to end up in hot water if you don't do your due diligence!



If you have very specific needs, I would suggest finding a base image you 
are happy with license-wise, then grab just the Tika components you want 
on top of that. Use our dockerfile as a guide of how to install and run 
Tika.



Apache Tika itself, and all required dependencies are available under the 
Apache License v2 or similar, see 
https://www.apache.org/legal/resolved.html for the general policy we work 
to. Some of the command line tools we can call out to, and things they 
use, may be under other licenses (especially copyleft ones), but those are 
all optional.


Nick

RE: UNSUBSCRIBE

2021-04-16 Thread Nick Burch

On Fri, 16 Apr 2021, Maloney, Patrick (ITS) wrote:
Thanks, but that info is not in the individual e-mails...I checked for 
that.


Hmm, that might be an issue with your email client. Every list message has 
this in the headers


Mailing-List: contact user-h...@tika.apache.org; run by ezmlm
Precedence: bulk
List-Help: 
List-Unsubscribe: 
List-Post: 
List-Id: 
Reply-To: user@tika.apache.org
Delivered-To: mailing list user@tika.apache.org

Most mail clients (but sadly not all!) show that and offer an easy click 
to do the unsubscribe


Nick


Re: UNSUBSCRIBE

2021-04-16 Thread Nick Burch

On Fri, 16 Apr 2021, Maloney, Patrick (ITS) wrote:

UNSUBSCRIBE


To unsubscribe from the Apache Tika users list, send an email to
user-unsubscr...@tika.apache.org and then reply to confirm. This info is 
also included in every email


Nick


RE: Parsing PDF file - setting threshold of unmapped characters

2021-04-14 Thread Nick Burch

On Wed, 14 Apr 2021, Peter Kronenberg wrote:

Anyone have any thoughts on this?


I think both an absolute and a percentage would be good, but I don't have 
enough experience to comment on your suggested numbers for those two 
thresholds, sorry!


Your idea on best vs fast touches on much older discussions on what to do 
when we have multiple possible parsers available. For example, an external 
program that's slow but official and very reliable, or a java library 
that's quick but misses some edge cases. We never did manage to reach a 
conclusion on that though...


Nick



Subject: RE: Parsing PDF file - setting threshold of unmapped characters

I’ve been thinking about this and I think it would be a good idea to change the 
comparison of unmapped characters to a percentage.  For example, you suggested


unmappedUnicodeCharsPerPage > 10 && percentUnmappedUnicodeChars > 0.2 or 
something?



The percentage could be configurable.



Another thought I had to was to have an AUTO_BEST and AUTO_FAST.  AUTO_FAST 
would have a higher threshold of Unmapped Characters, so that in most cases, it 
would just extract text and not use OCR.  The performance overhead of OCR is 
very high for not a lot of benefit given that it extracts 99% of the text.

AUTO_BEST would have a lower threshold before OCR is triggered.



Or just keep AUTO and allow the threshold to be configured, either by number of 
characters or percentage.  The only downside to this is that the user would 
have to understand it a little more to be able to set the threshold properly, 
instead of AUTO just working magically



What do you thin/


Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI


From: Tim Allison mailto:talli...@apache.org>>
Sent: Monday, April 5, 2021 1:49 PM
To: Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>>
Cc: user@tika.apache.org
Subject: Re: Parsing PDF file

Y. You understand perfectly!

I want "auto" to be the best it can be and most generally applicable across use 
cases.  For users who want high performance/better control, you might parse the PDF first 
with NO_OCR, and then make the determination on which pages to run OCR based on those 
statistics pulled out in the first parse.  Another key statistic in the decision would be 
the out of vocabulary measurement that you can get with an integration with tika-eval.

So, in short, if there are clear, provable, general improvements to AUTO, we 
should make them.  If you want more refined control, let us know if the current 
metadata can be improved to help you develop your application for your use 
cases.

On Mon, Apr 5, 2021 at 1:06 PM Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>> wrote:
You’re right that OCRing would result in slightly more accurate results in this 
case.  But the performance penalty is high.  Wondering if there is some 
intermediate option.

I think I understand now why you are separately looking for unmapped characters 
as well as total characters.  If total characters is low, we assume the page is 
an image and OCR.  But if unmapped characters is high, it might still be 
straight text, but the unmapped characters will essentially result in 
unreadable characters

From: Tim Allison mailto:talli...@apache.org>>
Sent: Monday, April 5, 2021 11:39 AM
To: Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>>
Cc: user@tika.apache.org
Subject: Re: Parsing PDF file

As for the metadata, we should add unique.  Given that multiple parsers can hit 
the same file, we need to record all of them (in this case: default, pdf, 
tesseract).

As for tweaking the settings...I'm not sure as I look at the extracted text more.  There are quite 
a few bad ligatures /unmapped unicode chars which would render search for, e.g. 
"efficient", "affairs" useless.

On Mon, Apr 5, 2021 at 10:40 AM Peter Kronenberg 
mailto:peter.kronenb...@torch.ai>> wrote:
Yes, I think tweaking the criteria for Auto is a good idea.
And if the parser list was a Set, that would automatically eliminate dups

From: Tim Allison mailto:talli...@apache.org>>
Sent: Monday, April 5, 2021 10:15 AM
To: user@tika.apache.org
Subject: Fwd: Parsing PDF file

It looks like the ligatures don't have unicode mappings:

"Division of Monetary A???airs"


if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10)

The issue is that this file has > 10 unmapped unicode chars per page.

We could change the heuristic to unmappedUnicodeCharsPerPage > 10 && 
percentUnmappedUnicodeChars > 0.2 or something?

We should also probably check to see if a parser is in the parsed by list 
before re-adding it?



0: pdf:charsPerPage : 1579
0: pdf:charsPerPage : 1891
0: pdf:charsPerPage : 2283
0: pdf:charsPerPage : 2224
0: pdf:charsPerPage : 1619
0: pdf:charsPerPage : 2177
0: pdf:charsPerPage : 1626
0: pdf:charsPerPage : 1313

Re: TikaServer Header Name is Case-sensitive

2021-03-15 Thread Nick Burch

On Mon, 15 Mar 2021, Subhajit Das wrote:
It seems that TikaServer 1.25 header like “X-Tika-PDFOcrStrategy” is 
case sensitive.


Yes. That's bcause those then get mapped onto underlying Java classes and 
methods, which are case sensitive



According to 
:https://stackoverflow.com/questions/5258977/are-http-headers-case-sensitive, 
header names should not be case sensitive.
Is there any to configure this?


Nothing on the Tika side, we are case sensitive due to the mapping to 
underlying Java stuff. You'd need to do any configuration on your end to 
not mangle the headers, sorry


Nick

Re: Microsoft alternate fonts on RHEL

2021-03-06 Thread Nick Burch

On Sat, 6 Mar 2021, Subhajit Das wrote:
But, the fonts and packages are not available on RHEL, as those are 
Debian packages.


Please suggest alternate option to setup all supported fonts and 
packages on RHEL.


Without a RHEL support login I can't be sure if these help or not, but I'd 
suggest starting with

https://access.redhat.com/solutions/2605
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/desktop_migration_and_administration_guide/configure-fonts

There's also some advice at
https://unix.stackexchange.com/questions/217365/how-to-install-microsoft-true-type-fonts-for-centos-7

Nick


Re: Re-using a TikaStream

2021-03-01 Thread Nick Burch

On Mon, 1 Mar 2021, Tim Allison wrote:

detectors should return the stream reset to the beginning.


I agree - needs to be ready for the parser to then process


Parsers, IIRC, should return the stream fully(?) read but not closed.


Not always - if the parser wanted a File then it may not have touched the 
stream.


Equally if the parser can't handle the file (eg it starts reading, finds a 
version number that indicates it isn't able to handle it and gives up), 
then the stream won't be readu


Nick


RE: Re-using a TikaStream

2021-03-01 Thread Nick Burch

On Fri, 26 Feb 2021, Peter Kronenberg wrote:
For most audio files, using the AudioParser, the buffer is still at the 
beginning.  Even though there is no text extraction, I would think that 
Tika still needs to read through the stream. The MP3Parser consumes the 
stream, but the MP4Parser does not


IIRC the MP4 parsing library we use needs a File not a Stream, so we have 
to spool everything to disk


The OCR parser also leaves the pointer at the beginning.  It definitely 
consumes the stream, so it must be resetting it.


OCR needs a file to call out to Tesseract with, so has to spool the stream 
to disk


So what is going on.  And now I get back to my original question, which 
is, what is the best way to consistently be able to re-use the stream?


Force Tika to spool to disk is probably the only way to be sure, assuming 
you don't have enough memory to always buffer everything in ram


Nick


RE: Re-using a TikaStream

2021-02-23 Thread Nick Burch

On Tue, 23 Feb 2021, Peter Kronenberg wrote:
I was re-reading some emails with Nick Burch back around Dec 22-23 and 
maybe I mis-understood him, but it sounds like he was saying that 
TiksInputStream was smart enough to automatically spool the stream to 
disk to allow re-use.


If a parser knows it is going to need to have a File, or knows it will 
need to re-read multiple times, it can tell TikaInputStream which will 
save to a temp file. If you as the caller know this, you can force it with 
a getFile / getPath call


If spooling to a local file is expensive, but restarting the stream 
reading is cheap, then the InputStreamFactory can be used instead. 
Typically that's with cloud storage or the like


Nick


Re: Error calling ImageMagick

2021-02-12 Thread Nick Burch

On Thu, 11 Feb 2021, Tim Allison wrote:

I can replicate this on my windows laptop.

The weird thing is that the image file is actually there and if I pause the
debugger at the point after imagemagick has complained that the file isn't
there but before Tika does the clean up,


Windows is funny about two programs having the same file open, especially 
if one of them has it for read-write. It does mean that Windows has file 
semantics that better match what people expect (opening a file on unix 
then deleting it, but still being able to keep reading the old file 
confuses developers the first time they discover it!), but also means you 
have to be more careful about closing files, simultanious access etc


Nick


Re: OCR on PDFs

2020-12-31 Thread Nick Burch

On Thu, 31 Dec 2020, Peter Kronenberg wrote:
I've got Tika working with Tesseract on PDF files, but it seems that if 
I give it a PDF file that has both searchable text and images, the text 
is OCRed twice.


Is this a PDF where some other tool has already done the OCR and stored 
the text it found behind the image?


If you highlight the image in Acrobat Reader, does it manage to select 
some text? If you copy and paste do you get text out?


Does this PDF have a mixture of "normal" text and images containing text, 
or is it all just "image text"?



Answers to these will affect how much Tika can help / be configured!

Thanks
Nick


Re: Metadata

2020-12-29 Thread Nick Burch

On Mon, 28 Dec 2020, Peter Kronenberg wrote:
For the metadata that comes back from a parse (example below), clearly, 
the fields are dependent on the file type and information available. 
Are there any 'standard' fields that come back for all/any files?  Such 
as Author, date, x-parsed-by, etc.  Is there a list of these somewhere?


Main ones are taken from Dublin Core, see:
http://tika.apache.org/1.25/api/org/apache/tika/metadata/DublinCore.html

Other ones that a fair number use come from:
http://tika.apache.org/1.25/api/org/apache/tika/metadata/TikaMetadataKeys.html
http://tika.apache.org/1.25/api/org/apache/tika/metadata/HttpHeaders.html

The full set of properties is defined in the interfaces at:
http://tika.apache.org/1.25/api/org/apache/tika/metadata/package-summary.html

Nick


RE: Mimetypes

2020-12-23 Thread Nick Burch

On Wed, 23 Dec 2020, Peter Kronenberg wrote:
Best is to wrap as a TikaInputStream, detect using all the detectors 
via >DefaultDetector, then parse after that.


But sometimes the detect will read the whole file, right?  For example, 
for Word.  So is it then making 2 passes?


Nope, we stash the open container ready for re-use by the parser
https://tika.apache.org/1.24.1/api/org/apache/tika/io/TikaInputStream.html#getOpenContainer--


IIRC it does if you use AutoDetectParser but not always otherwise


Oh, ok, you’re right.  It’s listed as Content-Type.  I was searching for 
Mime-type. 


Yes, that's the standard http header for it, and we try to re-use existing 
definitions where possible!


Nick

RE: Mimetypes

2020-12-23 Thread Nick Burch

On Wed, 23 Dec 2020, Peter Kronenberg wrote:
But yet, if I understand correctly, using a TikaInputStream *will* spool 
the entire stream to disk so it can read everything, right?  If I 
re-read the stream to parse, is it making 2 passes?


TikaInputStream has logic in it dump the stream to a temp file so it can 
be re-read multiple times as required. It only does that dump if required 
though, for formats that don't need it, it just acts as a buffering / mark 
+ reset Stream


In my use case, we will not have any filename or metadata.  It will just 
be a stream.  But you're right in that we will want to parse it.  So it 
sounds like the best way to do it is to do the detect on the first few 
bytes, which will at least give you an idea of what it is, but not 
precise. (Should this be a TikaStream?)  And then do the parse.


Best is to wrap as a TikaInputStream, detect using all the detectors via 
DefaultDetector, then parse after that.


I'm still surprised, however, that the mimetype doesn't seem to appear 
on the Metadata after parsing.


IIRC it does if you use AutoDetectParser but not always otherwise, but I'm 
not certain on that...


Nick


RE: Mimetypes

2020-12-23 Thread Nick Burch

On Tue, 22 Dec 2020, Peter Kronenberg wrote:

Oh, so reading the stream doesn't read the whole file?


Not for Detect, no. The assumption is that Detect is normally followed by 
Parse, so you won't want the Stream consuming, so we do a mark/reset to 
check the first few kb only


I know for Office files you can tell it's an Office file from the first 
dozen or so bytes, but you have to read the 2nd 512 block to find out 
more.


Not always... Many tools opt to put the properties blocks very close to 
the start, which lets you tell the type (because you can see the entry 
names), not all do. For the rest, you need to open the OLE2 structure and 
check the names of the entries


Nick


Re: Mimetypes

2020-12-22 Thread Nick Burch

On Tue, 22 Dec 2020, Peter Kronenberg wrote:

I'm trying to detect the mimetype of a file using both

Tika.detect(InputStream)
and
Tika.detect(File)

I get 2 different results.  I'm testing with a Microsoft Word (.doc) file.


The InputStream one is based on just the first few kb of the file. That's 
enough to figure out it's an OLE2 file, but not what flavour


The File one reads the whole file, checks the OLE2 directory entries, and 
identifies that you have a Word file



If you gave Tika the InputStream + filename on a Metadata object, it would 
specialise the OLE2 type to Word based on the extension


If you gave Tika a TikaInputStream, it would detect that a File was needed 
for a fully precise answer, spool the Stream to a File, then use that to 
detect (and later parse if you need)


Nick


Re: Extract URLs from a document

2020-11-12 Thread Nick Burch

On Wed, 11 Nov 2020, nensick wrote:
I am exploring the available features and I managed also to extract 
Office macros but I still don't find a way to get the links.


Imagine to have a PDF, a DOCX in which you have a "click here" text as a link 
pointing
to a website (let's say example[.]com). How can I get example[.].com?


If you were calling the Java directly, it would be fairly easy - just 
provide your own content handler that only captures the  tags and 
records the href attributes of those. You can use the Tee content handler 
to have a normal text-extraction handler called as well as your 
link-capturing one


From the Tika Server, it's not quite so simple. I'd probably just say ask 
the Tika Server for the xhtml version of your document (instead of the 
plain text one), then use the xml parsing in your calling language to grab 
the links from the a tags. Depending on your needs, either call the Tika 
Server twice, once for xhtml to get tags and once for plain text, or just 
once for xhtml and process the results twice


Nick


Re: WARNING: org.xerial's sqlite-jdbc is not loaded for 1.2.4

2020-04-22 Thread Nick Burch

On Wed, 22 Apr 2020, Tim Allison wrote:

Y. Agreed. Where should we document this? Where would you look for it?


The Tika Server and Tika App both get a fair bit of use from non-Java devs

Maybe we need a quickstart for non-Java folks section, and probably a 
python-specific one as we get loads of queries from newbies on that front 
who we want to help!


Nick


Re: WARNING: org.xerial's sqlite-jdbc is not loaded for 1.2.4

2020-04-21 Thread Nick Burch

On Mon, 20 Apr 2020, Bradley Beach wrote:

I have tried every permutation of adding sqlite-jdbc-3.30.1.jar to my
classpath but still get:
 
java -classpath ".:sqlite-jdbc-3.30.1.jar" -jar tika-server-1.24.jar
--host=localhost --port=12345


You can't combine -classpath and -jar, you have to use one or the other. 
If you give -jar then -classpath is ignored...


Nick

Re: Setting PDF2XHTML img src

2020-01-03 Thread Nick Burch

On Fri, 3 Jan 2020, Mike Dalrymple wrote:

I've just started using Tika to process PDFs with embedded images.  I'm
getting fantastic results but I'm having to post-process the generated
XHTML to correct the value of the src attribute on the img elements.


That is expected. A simple sax handler should let you do that, to re-write 
it to where you're saving the images



The generated XHTML has elements like:




The embedded prefix is Tika's way of letting you know there was an 
embedded image there, and what name it would have if you extracted it 
(which you may not of done).


The idea is that, for the extract+display case, you re-write it to match 
where you stored the image. For other cases, you know it was an embedded 
image rather than an external reference


Any direction would be greatly appreciated.  I'm currently just passing 
the generated XHTML through a regex that converts the src attributes and 
that works fine, it just feels like there may be a more idiomatic way 
that I'm not seeing.


Several jobs ago, I wrote some code to do this for Alfresco:
https://github.com/alfresco-mirror/alfresco-mirror/blob/b3d815063d3634d4bde83b4a214db62215a490fd/root/projects/repository/source/java/org/alfresco/repo/rendition/executer/HTMLRenderingEngine.java#L490

The idea is it re-writes just the embedded image links to point to a 
specific folder path or prefix where the embedded images were written, 
while leaving all other (external) images alone


Nick


Re: Encoding detectors in OSGi (tika-bundle)

2019-11-12 Thread Nick Burch

On Tue, 12 Nov 2019, Katsuya Tomioka wrote:
I'm having trouble accessing encoding detectors in OSGi with Tika 1.22. 
AutoDetectParser returns "Failed to detect the character encoding of a 
document" for non-Latin text. We are migrating from 1.10, I'm sure many 
things are different. It seems like my problem is while all the 
detectors are in tika-parser, the code is loading from tika-core's. I 
see parsers and detectors are tracked as services. Do I need to do 
something similar to load encoding detectors as well?


The things which are currently loaded via services are:
 * Parsers
 * Detectors (file type)
 * Translators
 * Encoding Detection
 * Langauge Detection
 * Probability-based type detectors

I think there might be helpers to assist with those, hopefully one of our 
OSGi experts will be along shortly to advise!


Nick


Re: Anyone have a nice Unix service script for running Tika Server?

2019-10-16 Thread Nick Burch

On Wed, 16 Oct 2019, Eric Pugh wrote:
I’m looking at running Tika Server mode in a Linux box (and sorry, I 
don’t know the specific flavour….).  Is there a nice service script to 
deal with bring Tika back up if the Linux box is restarted?


Are you using a systemd-based linux, or a different one, eg the older 
sysv init?


If using systemd, it's pretty easy to write a service file which will 
start the server on boot, and restart it if it fails. ExecStart generally 
needs full paths for java stuff, but otherwise is fairly easy, especially 
when following a systemd tutorial for your chosen flavour!


Nick

Re: Sample Rate / Audio Sample Rate not included in XML output

2018-10-17 Thread Nick Burch

On Wed, 17 Oct 2018, Tim Allison wrote:

This is one of the limitations of a streaming write.  As I look at
the code of the MP3Parser, I _think_ it would be trivial to write the
metadata before writing any content, and it wouldn't get in the way of
a streaming parse because the parser reads the whole file and caches
the content as it goes -- only writing once it has finished reading
the file.


IIRC some of the metadata is only known once all parsing is finished, eg 
the audio duration, which may be why it's currently done as it is


Nick


Re: Google Takeout GChat messages

2018-09-05 Thread Nick Burch

yOn Tue, 4 Sep 2018, Tucker Barbour wrote:
I've exported a GMail archive in MBOX format using takeout.google.com. The 
MBOX archive also includes GChat messages. However, the GChat messages do not 
include a Date header. Instead the date sent is included in what appears to 
be a non-conforming RFC822 header which the tika mbox parser does not 
recognize.


As a user of Tika, were you expecting these to show up as additional 
emails in the mbox, or something else?


(The underlying library may not give us a choice, I haven't dug in enough 
recently to remember, but in case it does, user expectations are of 
interst!)


I'm wondering if anyone has any experience extracting metadata from 
Gmail exports, specifically gchat messages. Any help or guidance would 
be appreciated.


Any chance you could share / produce a small mbox file, with a handful of 
both real emails and these gchat messages in, so we can take a look? If 
you could open a bug in jira, and attach the small mbox file, that'd be 
great


Nick


Re: Forcing Parser Invocation

2018-04-24 Thread Nick Burch

On Mon, 23 Apr 2018, lewis john mcgibbney wrote:

Using the tika-server, I am having issues parsing the attachment ENVI hdr
file at [0] with the EnviHeaderParser [1].

Is there any way I can explicitly force execution of the EnviHeaderParser?


I think not directly on a per-request basis. All the Tika Server endpoints 
go through the createParser() method of TikaResource which gets a new 
AutoDetect parser. Short of a Tika Config only containing the ENVI parser, 
I don't think you can directly force it with a header.


Your only option really is to get Tika to detect your file as an ENVI one. 
That means defining the envi type in the Tika Mimetypes file (it doesn't 
seem to be there...), then using mime magic or an explicit type header to 
get it detected as ENVI. At that point the ENVI parser should kick in


Nick


Re: Tika Parsers jar?

2018-04-19 Thread Nick Burch

On Thu, 19 Apr 2018, AJ Weber wrote:
But I can't find that jar anywhere in any of the download areas.  (I 
don't know why, but my maven isn't working properly.)


You need to use Maven / Gradle / Ivy to fetch it, and everything it 
depends on


Can someone point me to the location of such a jar and a list of 
dependencies, in case I can't get maven working and have to d/l them the 
old way?


The list is absolutely huge, so no!

As a quick-and-dirty fix, you can just grab the Tika App jar and use that, 
it has the Tika Parsers and all their dependencies inlined it it. Longer 
term, you should get a working build tool like Maven or Gradle


Nick

Re: Hex of RSS xml file is not recognized as RSS file MIME type

2018-04-19 Thread Nick Burch

On Wed, 18 Apr 2018, Jean-Nicolas Boulay Desjardins wrote:

I converted this RSS XML content to hex:




Then send it to Tika... Tika returns: text/plain


Base 64 encoded XML is no longer valid XML, so this is as expected.


Why am I not getting the rss mime type?


You need to send Tika the real file as-is

Nick


Re: Subfile Extraction

2018-03-27 Thread Nick Burch

On Sun, 25 Mar 2018, McGreevy, Anthony wrote:
I am currently playing with Tika to see how it works with regards to 
extraction of subfiles.


Do you mean files or resources embedded within another file?

If so... With the Tika App, you want -z to have these extracted. With the 
Tika java classes, you want to pop something like a 
https://tika.apache.org/1.17/api/org/apache/tika/parser/RecursiveParserWrapper.htmlhttps://tika.apache.org/1.17/api/org/apache/tika/parser/RecursiveParserWrapper.html

or a
https://tika.apache.org/1.17/api/org/apache/tika/extractor/ContainerExtractor.html
on your ParseContext to get called for embedded resources. See
https://wiki.apache.org/tika/RecursiveMetadata for more on how it works 
and how to have Tika parse + return all the embedded files and resources


Nick


Re: Unable to use -classpath

2018-03-05 Thread Nick Burch

On Sat, 3 Mar 2018, Jean-Nicolas Boulay Desjardins wrote:

I am using this command:

java -classpath /home/$USER/Projects/Lab/tika/classes/ -jar
./tika-app/target/tika-app-1.17.jar


Java ignores -classpath if you also specify -jar


In /home/$USER/Projects/Lab/tika/classes/ I have:
sqlite-jdbc-3.19.3.jar


Java only reads classes from a directory on a classpath by default, not 
jars, so the jar in here will be ignored



So, you need to stop using -jar if you want -classpath to be used, and to 
tell Java to load your other jars by either giving a wildcard on the 
classpath line or explicitly specifying all of them


Nick


Re: Malware RTF is not detected as RTF

2018-03-01 Thread Nick Burch

On Thu, 1 Mar 2018, Jim Idle wrote:
Malicious RTF files take advantage of the fact that Microsoft do not 
follow their own RTF spec. Specifically, Word et al only looks for the 
opening sequence:


{rt

Thought the spec says it should be:

{rtf1


I don't think that Tika can assume that all RTF users are as broken as 
Word is!


I'd be tempted to define a new mimetype of application/x-broken-rtf or 
similar, and feed that a lower priority magic for {\rt, with a suitable 
comment/explanation. That way, we won't tell people something is an RTF 
which isn't, but we can help them spot these problematic files


If you could create a small, broken but non-malicious rtf file, then raise 
an enhancement jira + attach, that'd be great!


Nick


Re: Long time with OCR

2018-02-20 Thread Nick Burch

On Mon, 19 Feb 2018, Mark Kerzner wrote:
Is that a good approach? Is the 10 seconds time normal? I am using the 
latest most powerful Mac and I get similar results on an i7 processor in 
Ubuntu.


Tika uses the open source Tesseract OCR engine. Tesseract is optimised for 
ease of contributions and ease of implementing new approaches, rather than 
for performance, because as an (ex?-) accademic project that's more what 
they think's important


There's some advice on the Tesseract github issues + wiki on ways to speed 
it up, eg https://github.com/tesseract-ocr/tesseract/issues/263 and

https://github.com/tesseract-ocr/tesseract/issues/1171 and
https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance

Otherwise you'd need to switch to a proprietary OCR tool. I understand 
that the Google Cloud OCR is pretty good, if you don't mind pushing all 
your files up to Gooogle and paying per file


Nick


Re: Detect JSON / PDF specific mime type

2018-02-05 Thread Nick Burch

On Mon, 5 Feb 2018, Matteo Alessandroni wrote:
I'm using Apache Tika to detect a file Mime Type from its base64 
rapresentation. Unfortunately I don't have other info about the file 
(e.g. extension).


and it gives me "text/plain" for JSON and PDF files, but I would like to 
obtain a more specific information: "application/json", 
"application/pdf" etc...


You can't detect JSON files from mime magic alone - json doesn't have 
anything unique at the start, just lots of possible different things which 
also occur in other formats too


Tika can detect a PDF from the magic bytes at the start just fine. Make 
sure you're actually decoding the base64 representation properly


Nick


Re: Binary file check

2018-01-21 Thread Nick Burch

On Fri, 19 Jan 2018, Kudrettin Güleryüz wrote:

One more thing, regarding application/xml vs text/xml
I think I'll skip application/xml for now and just include text/xml

Assuming application/xml is compressed XML such as Open office documents
and text/xml as uncompressed XML


Nope! They're both uncompressed textual XML!

Generally though, when defining a new xml-based filetype, the spec authors 
decide if it's going to be vaguely readable-editable or opaque, then pick 
if they go for text/xml or application/xml as the parent type. Can be a 
bit random which they go for though! See 
https://stackoverflow.com/a/4832418/685641 for a bit more info and some 
references


Nick

Re: Binary file check

2018-01-14 Thread Nick Burch

On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote:
I am not an expert on mime types and how they extend.  My definition of 
binary is any file that is not in human readable form. Any other file, 
I'd like to index. Would that answer your question?


Some of us humans here can read a wide range of formats than others, 
especially if we go slowly... ;)


For now, I'd suggest you start with:
 * Does the mimetype start with text/ ?
 * If not, check all parents (supertypes) to see if any of those start
   with text/

Then:
 * Try a few formats with a parent of application/xml, and see if you want
   to include or exclude those (are they human readable enough?)
 * Try a few formats with a parent of text/xml or text/html, and see if
   you want to include or exclude them (ditto on really human readable)

Use 
https://tika.apache.org/1.17/api/org/apache/tika/mime/MediaTypeRegistry.html#getSupertype-org.apache.tika.mime.MediaType-
to get the parent types

Use 
http://tika.apache.org/1.17/api/org/apache/tika/mime/MediaType.html#getType--
to check if a mimetype if text/ or not (check for getType().equals("text"))

Nick

Re: Binary file check

2018-01-11 Thread Nick Burch

On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote:

Does Tika library provide an efficient binary file check?


How do you define "binary"?

Only things with a mimetype that starts text/ ? Or do you want to include 
application/xml files? Or things that extend form XML like DIF and 
FictionBook? Only things that contain ascii-printable characters? Other?


We need to know your definition of binary to be able to suggest!

Nick

RE: Very slow parsing of a few PDF files

2017-11-21 Thread Nick Burch

On Tue, 21 Nov 2017, Jim Idle wrote:
Following up on this, I will try cancelling my thread based tasks after 
a pre-set time limit. That is only going to work if Tika and the 
underlying parsers behave correctly with the interrupted exception. 
Anyone had any success with that? I am mainly looking at Office, PDF and 
HTML right now. I will try it myself of course, but perhaps someone has 
already been down this path?


Have you tried with ForkParser? That would also protect you against other 
kinds of failures like OOM too


Nick


Re: Very slow parsing of a few PDF files

2017-11-06 Thread Nick Burch

On Tue, 7 Nov 2017, Jim Idle wrote:

I have a few PDF files that are taking a very long time to parse.


Are you sure it's a PDF? The profiler images you've sent are all for 
Apache POI and seem to show a XLS file being parsed


Nick


Re: Using TikaConfig troubles

2017-11-03 Thread Nick Burch

On Fri, 3 Nov 2017, Markus Jelsma wrote:

This is how Nutch gets the parser:
Parser parser = tikaConfig.getParser(MediaType.parse(mimeType));

When no custom config is specified config is:
new TikaConfig(this.getClass().getClassLoader());

When i specify a custom config, it is:
tikaConfig = new TikaConfig(conf.getResource(customConfFile));


I think you need to give both the classloader and the config file for your 
setup


Can you try this constructor:
https://tika.apache.org/1.16/api/org/apache/tika/config/TikaConfig.html#TikaConfig-java.net.URL-java.lang.ClassLoader-

With something like
  new TikaConfig(conf.getResource(customConfFile),
 this.getClass().getClassLoader());

Nick


Re: Java 9 and JAXB dependency in tika-core

2017-09-14 Thread Nick Burch

On Thu, 14 Sep 2017, Robert Munteanu wrote:

One of the issues that came up is that tika-core has a dependency on
JAXB [1]. The javax.xml.bind packages are no longer part of the java.se
module, and therefore not available by default on the module path. The
issue can be triggered with a simple invocation of tika-app on Java 9


Is there a recommended way to replace JAXB that still works on older 
versions of Java?



Is there interest in making Tika work on Java 9 without the need to use
the '--add-modules' switch? That would entail just removing the
java.xml.bind dependencies ; for tika-core and tika-parsers all the
dependencies are contained in java.se .

Given that there is interest, what would be the preferred solution and
the plans for a next release? I might be able to provide a patch if
it's not too invasive.


If you can think of a way to re-do the XML parsing in the Tika Config 
classes, such that they still work on Java 7+, but also work OOTB on Java 
9, we'd love a patch!


If not, if you could find some guideance online for how to migrate 
JAXB-using code to work with both Java 9 and 7+, we can take a look at 
some point


Thanks
Nick


Re: Detecting .bat and .cmd files

2017-08-23 Thread Nick Burch

On Wed, 23 Aug 2017, epast...@vt.edu wrote:

I'm trying to get tika to detect .bat and .cmd files. Both are returning as 
text/plain.

In the xml file, 
(https://github.com/apache/tika/blob/master/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml) 
bat falls under application/x-msdownload but yet it returns as 
text/plain.


Good spot! I've raised TIKA-2445 for this. Should now be fixed - both 
Windows .bat and .cmd should now be detected as application/x-bat, which 
seems to be the closest to a consensus mimetype for them


Nick


Re: Performance Improvement AutoDetectParser

2017-08-04 Thread Nick Burch

On Fri, 4 Aug 2017, aravinth thangasami wrote:

we are using Tika 1.13.


1.15 is out!


While instantiating AutoDetectParser we found that the
CompositeExternalParser which actually we don't need, takes up more time.
It because of  ExifTool & FFmpeg.

I tried with removing CompositeExternalParser from Jar and we are seeing an
Improvement.


You should be able to exclude that from DefaultParser in config with a 
parser-exclude:

http://tika.apache.org/1.16/configuring.html#Configuring_Parsers

Then make sure you create your AutoDetectParser from the config with that 
exclude


Nick


Re: Parse file without creating tmp file

2017-07-11 Thread Nick Burch

On Tue, 11 Jul 2017, aravinth thangasami wrote:

Recently I have noticed tika creates a tmp file in before parsing the
stream.


Only for certain formats, generally where the underlying parsing library 
requires a file for random-access



I don't have much experience in Tika but I feel it is an overhead.
Can we achieve file parsing without writing to tmp file?


For some files, no, not without re-writing other open source libraries

For most, it isn't needed and Tika won't do it

Nick


Re: Adding a WARC parser to Tika

2017-07-10 Thread Nick Burch

On Mon, 10 Jul 2017, Allison, Timothy B. wrote:

Sorry, I can't tell if this is tongue-in-cheek...


No, I do think we should add a WARC parser to Tika Parsers.

Once done, I'd suggest we figure out a way for Tika Batch to run over a 
collection of WARC files just as it does for directories, to make it 
easier to run over crawl collections without having to unpack them first!


Nick


Re: Tika content detection and crawled "remote" content

2017-07-05 Thread Nick Burch
Having taken a "quick" look over lunch at some of the "programming 
language" ones, and gone down a rabbit whole... I think at least some of 
them are as described in TIKA-2419, where our change to the HTML magic 
priority to fix for HTML-containing formats like email had broken some 
things.


I've done a quick fix for 1.16, but it'd be good to try the impact of 
other things, eg dropping the xml priority to match the html one to see 
if that helps / breaks other things



Otherwise, for anything else (eg that word / graphviz one), please do 
open up JIRAs!


Thanks
Nick

On 05/07/17 14:10, Allison, Timothy B. wrote:

Why, yes, please!  JIRA with small samples would be fantastic.  I think working 
in desc order of most common to least would be best...php, asp, coldfusion.

I'm about to cut 1.16, but I look forward to improving Tika with this 
tremendously useful data.

Again, many thanks!

Cheers,

Tim

-Original Message-
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
Sent: Wednesday, July 5, 2017 9:03 AM
To: user@tika.apache.org
Subject: Re: Tika content detection and crawled "remote" content

Hi Tim,

thanks! Let me know if I should take any actions (e.g., open issue(s) on Jira) 
or whether I can help by compiling smaller test sets.

Best,
Sebastian

On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:

This is FANTASTIC!!!  Thank you, Sebastian!

I suspect that we should try to fix these at the Tika level.  We'll never be 
100%, but most of the problems you describe _should_ be fixable.

  > If anyone is interested in using the detected MIME types or anything else from Common 
Crawl - I'm happy to help!  The URL index [4] contains now a new field 
"mime-detected" which makes it easy to search or grep for confusion pairs.

This is an amazing step forward for our regression corpus.  We used to rely on 
the http headers and/or file suffix to oversample non-html.  This will allow 
far cleaner pulls.

-Original Message-
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
Sent: Tuesday, July 4, 2017 6:18 AM
To: user@tika.apache.org
Subject: Tika content detection and crawled "remote" content

Hi,

recently I've plugged in Tika's content detection into Common Crawl's crawler 
(modified Nutch) with the target to get clean and correct MIME type - the HTTP 
Content-Type may contain garbage and isn't always correct [1].

For the June 2017 crawl I've prepared a comparison of content types
sent by the server in the HTTP header and as detected by Tika 1.15
[2].  It shows that content types by Tika are definitely clean
(1,400 different content types vs. more than 6,000 content type "strings" from 
HTTP headers).

A look on the "confusions" where Content-Type and Tika differ, shows a mixed 
picture: some pairs are plausible, e.g., if Tika changes the type to a more precise 
subtype or detects the MIME at all:

 Tika-1.15HTTP-Content-Type
1001968023  application/xhtml+xmltext/html
2298146  application/rss+xml  text/xml
 617435  application/rss+xml  application/xml
 613525  text/htmlunk
 361525  application/xhtml+xmlunk
 297707  application/rdf+xml  application/xml


However, there are a few dubious decisions, esp. the group of web server-side 
scripting languages (ASP, JSP, PHP, ColdFusion, etc.):

  Tika-1.15 HTTP-Content-Type
2047739  text/x-phptext/html
  681629  text/asp  text/html
  193095  text/x-coldfusion text/html
  172318  text/aspdotnettext/html
  139033  text/x-jsptext/html
   38415  text/x-cgitext/html
   32092  text/x-phptext/xml
   18021  text/x-perl   text/html

Of course, due to misconfigurations some servers may deliver the script files 
unmodified but in general I wouldn't expect that this happens for millions of 
pages.  I've checked some of the affected URLs:

- HTML fragment (no declaration of  or  opening
tag)

https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580=_b=0_mb=0_q=0_a=2_r=1_bc=1_wc=0_we=0_ar=0_ack=0_v=0_d=0_ra=2_p=0
 http://www.privi.com/product-details.asp?cno=C10910011
 http://mental-ray.de/Root_alt/Default.asp
 http://ekyrs.org/support/index.php?action=profile
 http://cwmorse.eu5.org/lineal/mostrar.php?contador=200

- (overlong) comment block at start of HTML which "masks" the HTML declaration
 http://www.mannheim-virtuell.de/index.php?branchenID=2=24

http://www.exoduschurch.org/bbs/view.php?id=sunday_school=1==1=off=on=on_arrange=headnum=asc=6
 
https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
 https://de.e-stories.org/categories.php?=nl=p

- HTML with some scripting fragments ("") present:
 http://www.eco-ani-yao.org/shien/

- others are clearly HTML (looks more like a bug, at least, there is no simple 
explanation)
 http://www.proedinc.com/customer/content.aspx?redid=9
 

Re: Limit on input PDF file size in Tika?

2017-06-08 Thread Nick Burch

On Thu, 8 Jun 2017, tesm...@gmail.com wrote:

Thanks for your reply. I am calling Apache Tika in Java code like this:

public String extractPDFText(String faInputFileName) throws
IOException,TikaException {

  //Handler for body text of the PDF article
BodyContentHandler handler = new BodyContentHandler();


Change this for "new BodyContentHandler(-1)" to remove the write limit. 
More details in the javadocs:

https://tika.apache.org/1.15/api/org/apache/tika/sax/BodyContentHandler.html#BodyContentHandler-int-

Nick


Re: Limit on input PDF file size in Tika?

2017-06-08 Thread Nick Burch

On Thu, 8 Jun 2017, tesm...@gmail.com wrote:

My tika code is not extracting full body text of larger PDF files.

Files more than 1 MB  in size and around 20 pages are partially extracted.
Is there any limit on input PDF file  size in tika


How are you calling Apache Tika? Direct java calls to TikaConfig + 
AutoDetectParser? Using the Tika facade class? Using the Tika App on the 
command line? Tika Server? Other?


Nick


Re: Extracting macros in 1.15

2017-06-03 Thread Nick Burch

On Sat, 3 Jun 2017, Jim Idle wrote:
After being baffled why macros no longer show up in 1.15 I found: 
https://issues.apache.org/jira/browse/TIKA-2302


Can anyone point me to an example of doing this? I am finding bits and 
pieces but no example of turning macros back on.I basically want all 
macros in all documents, office, pdf, anything really.


How do you call Apache Tika? Tika App? Tika Server? Tika java class 
facade? Direct Java calls to TikaConfig / AutoDetectParser etc?


The solution will differ depending on which one you use

Nick


Re: TIKA for confidental documents

2017-05-13 Thread Nick Burch

On Sat, 13 May 2017, Julian Decker wrote:
is there any connection and data transfer to external servers by using 
the Tika Server or Tika App?


None out-of-the-box.

If you turn on Translation, or most of the NER / NLP / Object Recognition 
stuff, Tika will send the relevant things to your configured appropriate 
service.


Out of the box, the Tika App will run everything within the JVM. Out of 
the box, the Tika Server will run everything within its JVM, but obviously 
you'll have to send stuff over the network to get it to the server, so 
consider enabling SSL


Nick


RE: Extract Message-ID in EML file

2017-04-21 Thread Nick Burch

On Fri, 21 Apr 2017, Allison, Timothy B. wrote:

Probably?  Please open an issue on our JIRA and submit an example file.


I think you can often get it from
  Message:Raw-Header:Message-ID

But that isn't ideal. We probably ought to define a proper Message: 
property for it, and have all the email parsers expose that


Needs a Jira and proper tracking though!

Nick


Re: Fwd: Tika not parsing underlines

2017-01-04 Thread Nick Burch

On Thu, 5 Jan 2017, Kamesh Joshi wrote:

I am trying to parse the attached the pdf.but it does not give me the
places where the underline is present it just returns me plain text.
Please help me how can i also get the underline present in pdf or some way
to split text based on that.

I am using curl -T Downloads/kameshjoshi.pdf  http://localhost:9998/tika
--header "Accept: text/plain" in my command line.


You need to ask Tika to give you the HTML version to be able to spot 
markup like underlines. Swap that accept header to text/html and you 
should then be able to see them


Nick


Re: Mime type matching: tika-mimetypes.xml

2016-11-09 Thread Nick Burch

On Wed, 9 Nov 2016, Chris Bamford wrote:


 
   
   …
 
   
 
 ...



Does offset="0:8192" mean match 'Message-ID:' anywhere in the first 8192 
bytes?


Yup, that's it. If that is found, and nothing with a priority score of 
higher than 50 also matches, it'll return that type. If a higher priority 
matched, that other one will win.


(There's also some bits for if the extension matches a type in the same 
family, eg for specialising)


If so, I'm not sure it is working properly as I have some eml files with 
this string near the beginning (but not at byte offset 0) where it does 
not match.  Is there some other logic involved which I am missing?


If you can share a small file that shows it, we can take a look for you.

Nick

Re: Get file metadata without retrieving entire file with Tika Server

2016-10-13 Thread Nick Burch

On Thu, 13 Oct 2016, Mr Havecamp wrote:
However, the problem with either option is that we need to retrieve the 
entire file from storage; this is fine for smaller text files but when 
handling these larger files, it seems wasteful and time-consuming to 
download, say, a video file just to extract the metadata information (we 
wouldn't be indexing the video content).


For a great many file formats, including most video ones, you need the 
whole file to be able to fully extract all the metadata


Nick


Re: Tika: parsing mixed content e-mails

2016-10-06 Thread Nick Burch

On Thu, 6 Oct 2016, Ingo Siebert wrote:

Am 05.10.2016 um 20:04 schrieb Nick Burch:

On Wed, 5 Oct 2016, Ingo Siebert wrote:
I just used Tika (org.apache.tika:tika-parsers:1.13) to parse an e-mail 
with multipart/mixed content.


How do you want to get the various parts back? All text inlined, or a 
special callback for each part? What about the metadata for the parts?


A MS Office document consists also of several parts and chapters and I get 
them as one string.


A MS Office document can have other documents, images, sounds etc embedded 
in it too! You have to ask Tika for those in the same way


At least for my use-case I would be sufficient to get the data concatenated 
into on string, but I would also be nice if I get the parts separately.


If you pop a EmbeddedDocumentExtractor onto the ParseContext, that'll be 
called to let you handle each part in turn. You might want a 
ParsingEmbeddedDocumentExtractor to give you parsed contents rather than 
raw parts


Nick


Re: Code parser?

2016-09-29 Thread Nick Burch

On Wed, 28 Sep 2016, Mark Kerzner wrote:

probably yes, but how do I tell it which parser to use? Today, I just do
that

String text = tika.parseToString(inputStream, metadata);

and it know the parser.


That might be your issue. It's quite hard to identify the language of a 
piece of source code from just the first few hundred bytes of text. If you 
tell Tika the filename, including the extension, it'll have much more luck 
spotting the file is code and using the appropriate parser!


(Binary files often have common magic at/near the start that helps Tika 
identify the file type, source code is text based and lacks that)


Nick


Re: How to parse PDF files effectively with Tika

2016-09-12 Thread Nick Burch

On Mon, 12 Sep 2016, Sergey Beryozkin wrote:
By the way, I've found out AutoDetectParser may not work if the (pdf) stream 
is an attachment stream which may not support a mark.


Simplest would probably be just to wrap it in a TikaInputStream, which 
would handle any buffering/marking as needed


I've been wondering, would it make sense to pass a MediaType identifying the 
data format as either a ParseContext or Metadata property for 
AutoDetectParser to avoid trying to read the stream ?


If you pass in the filename + disable all detectors other than the magic 
one, or pass in the mime type + disable all detectors, the auto detect 
parser ought to skip anything to do with the stream at detection time


Nick


Re: Problem with detection of RFC822 message

2016-07-28 Thread Nick Burch

On Thu, 28 Jul 2016, Vjeran Marcinko wrote:
Just as I resolved the rpoblem with MBOX parser, I noticed that it 
doesn't correctly detect contained RFC822 messages as message/rfc822, 
but usually text/html or some variation of it.


And question as before, is there some workaround for 1.13 to place in
custom-mimetypes.xml that would fix this?


Can you create a small junit testcase that shows the problem, using either 
a small mbox file of your own, or one of the ones in the tika-parsers test 
documents directory? Attach that to a new JIRA issue, and one of us can 
use it to take a look at what's going wrong. Once we know the underlying 
issue, we can hopefully fix it, and maybe let you know a workaround!


Nick


Re: No Unicode mapping warnings

2016-07-26 Thread Nick Burch

On Tue, 26 Jul 2016, Oliver Steinau wrote:
I'm having problems extracting text from a small (43 KB) PDF file using 
tika-1.13 -- I get a bunch of warnings like


WARN  No Unicode mapping for C0104 (38) in font FDLICI+PSOwstswiss
WARN  No Unicode mapping for C0097 (31) in font FDLICI+PSOwstswiss


Can you try with the ExtractText tool from Apache PDFBox? 
http://pdfbox.apache.org/2.0/commandline.html#extracttext


If that works fine, then it's a Tika bug and we'll need to look into it. 
If that fails with the same problem, then you'd need to report a bug to 
PDFBox and attach a problematic pdf file to the jira. (Tika would then get 
the fix on the next release)


Nick


Re: Problem with detection of .mbox file

2016-07-25 Thread Nick Burch

On Mon, 25 Jul 2016, Vjeran Marcinko wrote:

I fist noticed that my .mbox file doesn't get parsed by MBoxParser,
and later, after debugging Tika source code, I found what the problem
is - default detector doesn't even recognize it as "applciation/mbox"
MIME type, and although file extension is .mbox, it ignores this hint
because its "magic" way of detecting file type based on some amount of
initial bytes detects it is "text/html"


Can you try with a recent Tika nightly build? Only there have been some 
tweaks done around that sort of thing recently


If a nightly build / build from Git still shows the issue, please open a 
bug in Jira and attach a problematic file, then we can take a look!


Nick


Re: DATE metadata from email

2016-05-15 Thread Nick Burch

On Sun, 15 May 2016, Philipp Steinkrüger wrote:
To begin with, I noticed the following behaviour which might or might 
not be a bug. I asked this question on stackexchange 
(https://stackoverflow.com/questions/37226842/tika-metadata-from-email-misses-date 
) 
but perhaps this is a better place.


I have two email testfiles:


It sounds like it might be a bug. Could you open a new bug entry in JIRA, 
and upload the two test files there? We can then use that to confirm


Nick

My "What's new with Apache Tika 2.0" talk slides

2016-05-11 Thread Nick Burch

Hi All

For those who couldn't make it to Vancouver this week, the slides from my 
"What's new with Apache Tika 2.0" talk are now available online:

http://www.slideshare.net/NickBurch2/apache-tika-whats-new-with-20

The audio was recorded, hopefully that will be available to go with the 
slides in a few days time


Nick


Re: XML Parser with type recognition

2016-05-11 Thread Nick Burch

On Wed, 11 May 2016, plug...@free.fr wrote:
If you can take a look at my little gist example 
https://gist.github.com/anonymous/3506db4367040ea8f381c5b7b435b3f9 it 
will be very helpful.


The localName parameter is case sensitive. Your sample file starts with 


Nick


Re: XML Parser with type recognition

2016-05-11 Thread Nick Burch

On Wed, 11 May 2016, plug...@free.fr wrote:
Ok if I understand I can create a specific mime type into 
tika-mimetypes.xml resource file like this:



 
 http://www.w3.org/2001/XMLSchema-instance"/>
 



Almost - you can't set that glob as it's already claimed. Otherwise, 
assuming that is the right namespace and name for your files, that's it. 
The xml root match will cause that mimetype to win over the default xml 
one


Nick


Re: XML Parser with type recognition

2016-05-10 Thread Nick Burch

On Tue, 10 May 2016, plug...@free.fr wrote:
But now I'm facing of detecting some XML files but only some specifics, 
I can't detect only "application/xml", I need to detect which type of 
XML is it (in my case 
http://www.iab.com/guidelines/digital-video-ad-serving-template-vast-3-0/). 
But the file uploaded are always terminated with .xml extension and with 
the "application/xml" content-type metadata.


So how can I detect this kind of specific XML with same mime type and 
same file extension than the generic XML Parser?


You should add acustom mimetype:
http://tika.apache.org/1.12/parser_guide.html#Add_your_MIME-Type
Have that extend application/xml and define the root xml element your 
format requires. If you look at the main tika mime types file, there's a 
few good examples almost at the top


Nick


Re: disable extraction of images

2016-04-13 Thread Nick Burch

On Wed, 13 Apr 2016, ron.vandenbranden wrote:
Is it possible to disable text extraction from images inside a PDF file? 
I'm testing with the CLI tika app, which has "extractInlineImages" set 
to false by default, if I'm not mistaken. Yet, the text of the images 
still is present in the generated HTML output. Am I missing something 
obvious?


Yup, see "Disable Tika OCR" in https://wiki.apache.org/tika/TikaOCR (or 
remove tessaract from your path!)


Nick


Re: Fwd: How to enable multiple parsers for content type ?

2016-03-23 Thread Nick Burch

On Wed, 23 Mar 2016, Thamme Gowda N. wrote:

Question : How to enable multiple parsers for specific mimetypes?

I am using tika to parse html pages.

My requirement is that both *NamedEntityParser* and *HtmlParser* has to be
enabled for specific web related MIME types like *text/html, *
*application/xhtml+xml*.


This is not currently supported.

See http://wiki.apache.org/tika/CompositeParserDiscussion for the 
discussion on it. If you have ideas on how we can solve the issue of 
multiple parsers needing to output to the same write-once SAX stream, 
including for the fallback case, please shout!


(You can chain multiple content handlers together, so one option might be 
to try to get the named entity stuff to enrich the html sax events stream 
rather than needing to be a standalone parser)


Nick


Re: Using tika-app-1.11.jar

2016-02-11 Thread Nick Burch

On Wed, 10 Feb 2016, Steven White wrote:

I'm including tika-app-1.11.jar with my application and see that Tika
includes "slf4j".


The Tika App single jar is intended for standalone use. It's not generally 
recommended to be included as part of a wider application, as it tends to 
include everything and the kitchen sink, to allow for easy standalone 
use


Generally, you should just tell Maven / Groovy / Ivy that you want to 
depend on Tika Core + Tika Parsers, then your build tool will fetch + 
bundle all the dependencies for you. That lets you have proper control 
over conflicting versions of jars etc


Nick


  1   2   3   >