Tika unable to extract PDF Text

2015-10-14 Thread Adam Retter
I have a PDF which was created using Apache PDF Box 2.0.0-SNAPSHOT. Unfortunately Tika 1.10 seems unable to extract any text from the PDF, I don't get any exceptions or errors. The code is as simple as: new Tika().parseToString(new FileInputStream(f)) Tika is always returning just the empty

Re: AutoDetectParser bug?

2015-10-14 Thread Ziqi Zhang
Thanks I have created an issue. metadata.set(RESOURCE_NAME_KEY, filename) also did not work. For now I am telling the parser specifically it is plain text files. But it would be really nice to have this addressed because I would like to use the auto detect ability in my app. regards > On

RE: Tika unable to extract PDF Text

2015-10-14 Thread Allison, Timothy B.
File works with Tika trunk. What's on your classpath: tika-app or just tika-core? Is there a chance that you don't have tika-parsers on your cp? -Original Message- From: Adam Retter [mailto:adam.ret...@googlemail.com] Sent: Wednesday, October 14, 2015 12:14 PM To:

Re: Fwd: AutoDetectParser bug?

2015-10-14 Thread Nick Burch
On Wed, 14 Oct 2015, Ziqi Zhang wrote: My apologies, here are the testing files attached. Any chance you could open a bug in bugzilla, and attach these files there? At first glance, it looks like those files have some certain text patterns near the start which is causing them to be

AutoDetectParser bug?

2015-10-14 Thread Ziqi Zhang
Hi There might be a bug with the AutoDetectParser, which fails to recognise some plain-text files as plain text. In the attachment are three testing files, as you can see they are all plain text. The following code is used for my testing: AutoDetectParser parser = new

Re: AutoDetectParser bug?

2015-10-14 Thread Konstantin Gribov
This is a result of false positive mime-type detection. In first case file starts with "ID3" which is usually present in mp3 (audio/mpeg) files. Other two files starts with P1 or P4 which are present in start of image/x-portable-bitmap files. You can either use text parser directrly or pass

Re: AutoDetectParser bug?

2015-10-14 Thread Ziqi Zhang
Many thanks As for bugzilla, I was unable to create a new bug, as it is saying “first you must pick a product…” and there is no tika in the list. > On 14 Oct 2015, at 10:40, Konstantin Gribov wrote: > > This is a result of false positive mime-type detection. In first case

Re: AutoDetectParser bug?

2015-10-14 Thread Nick Burch
On Wed, 14 Oct 2015, Ziqi Zhang wrote: As for bugzilla, I was unable to create a new bug, as it is saying “first you must pick a product…” and there is no tika in the list. Sorry, wrong project - POI uses Bugzilla, Tika uses JIRA, I wasn't paying enough attention! The starting point for