from:"Allison, Timothy B."

RE: Very slow parsing of a few PDF files

2017-11-27 Thread Allison, Timothy B.

The ForkParser does have the ability to kill and restart on permanent hangs.  
We don't have the RecursiveParserWrapper integrated into the ForkParser 
currently...patches are welcomed.

At the Tika level, we generally don't check for a Thread.interrupted() because 
our dependencies don't do it.  

Unfortunately, you do have to kill a process for a parser that hits a permanent 
hang.  Nothing you can do to a thread will actually be useful, see TIKA-456 for 
a discussion of this.

Some options:

1) The ForkParser will timeout and restart.

2) tika-batch, e.g. java -jar tika-app.jar -i  -o , will 
run multithreaded and it spawns a child process that will be killed and 
restarted on permanent hang/oom

3) tika-server...we could/should harden that via a child process that could be 
killed/restarted, but that doesn't currently exist.

4) framework, e.g. Hadoop, etc. see 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
 and Ken Krugler's email (somewhere on our list?!) about spawning a separate 
thread for each parse and then aborting the process if there's a timeout

Finally, no matter what option you use, you can use the MockParser in 
tika-core/tests to test that your processing pipeline can correctly handle 
timeouts/oom etc.  Add that to your class path and then ask Tika to parse, e.g. 
.  See: https://wiki.apache.org/tika/MockParser 

-Original Message-
From: Jim Idle [mailto:ji...@proofpoint.com] 
Sent: Tuesday, November 21, 2017 11:13 PM
To: user@tika.apache.org
Subject: RE: Very slow parsing of a few PDF files

I didn't know that there was a ForkParser, but that might possibly be a 
significant overhead on the application - looks like it has a pool, though I 
don't know if it gives the ability to say kill a long running parser and 
restart the pool. I will look in to it: one thing I see already is that it 
intercepts Interrupted, wraps it in a TikaException but does not set the Thread 
interrupted flag and cannot rethrow Interrupted because the Parser interface 
does not throw it. It catches inability to communicate but does it start a new 
process if I cancel one

I may have no choice though as RecursiveParserWrapper, like any implementation 
of Parser does not check for Thread.interrupted() or throw Interrupted which 
means that I cannot time out a Future and cancel it.

Anyway, thanks for the pointer - I will play with it.

Jim

> -Original Message-
> From: Nick Burch [mailto:apa...@gagravarr.org]
> Sent: Tuesday, November 21, 2017 17:10
> To: user@tika.apache.org
> Subject: RE: Very slow parsing of a few PDF files
> 
> On Tue, 21 Nov 2017, Jim Idle wrote:
> > Following up on this, I will try cancelling my thread based tasks 
> > after a pre-set time limit. That is only going to work if Tika and 
> > the underlying parsers behave correctly with the interrupted exception.
> > Anyone had any success with that? I am mainly looking at Office, PDF 
> > and HTML right now. I will try it myself of course, but perhaps 
> > someone has already been down this path?
> 
> Have you tried with ForkParser? That would also protect you against 
> other kinds of failures like OOM too
> 
> Nick

RE: Very slow parsing of a few PDF files

2017-11-28 Thread Allison, Timothy B.



>As the HTML parser in Tika does not produce SAX events in the correct order - 
>the parser is great but does not support serialization - etc.

Oh, please open a ticket with examples, or point me to one I've forgotten 
about... ☹  Thank you!

RE: Very slow parsing of a few PDF files

2017-11-29 Thread Allison, Timothy B.

>I am going to have to write my own application specific solution

Ugh.  I'm sorry.  If there's anything shareable, please do share.

> ForkParser tries to serialize every class it things will be needed across the 
> connection and a lot of third party classes are not serializable. I think 
> that ForkParser is a good enough idea but I am not sure how practical it is 
> in a real-life application. 

You make a very good point.  We've had issues serializing our own parsers...let 
alone user-specific addons.  I wonder if we could modify ForkClient to kick off 
the forkserver process from a user-specified "bin" directory (instead of the 
current bootstrapped jar), and that bin directory could include at least the 
tika-core.jar, tika-fat-parsers.jar and tika-serialization.jar but could also 
include optional dependencies and user-specific dependencies.  

Hmmm

RE: Very slow parsing of a few PDF files

2017-11-30 Thread Allison, Timothy B.

Great.  I opened TIKA-2514 to track this.  Pull requests are welcomed! 😊

-Original Message-
From: Jim Idle [mailto:ji...@proofpoint.com] 
Sent: Wednesday, November 29, 2017 8:58 PM
To: user@tika.apache.org
Subject: RE: Very slow parsing of a few PDF files

That would be a more practical alternative. I have time scheduled next week for 
an in-house solution but I will first look properly at ForkParser and see if I 
could make something akin to that in generic and configurable fashion. If so, I 
will submit the code.

Jim 

> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Wednesday, November 29, 2017 23:52
> To: user@tika.apache.org
> Subject: RE: Very slow parsing of a few PDF files
> 
> >I am going to have to write my own application specific solution
> 
> Ugh.  I'm sorry.  If there's anything shareable, please do share.
> 
> > ForkParser tries to serialize every class it things will be needed 
> > across the
> connection and a lot of third party classes are not serializable. I 
> think that ForkParser is a good enough idea but I am not sure how 
> practical it is in a real-life application.
> 
> You make a very good point.  We've had issues serializing our own 
> parsers...let alone user-specific addons.  I wonder if we could modify 
> ForkClient to kick off the forkserver process from a user-specified "bin"
> directory (instead of the current bootstrapped jar), and that bin 
> directory could include at least the tika-core.jar, 
> tika-fat-parsers.jar and tika- serialization.jar but could also 
> include optional dependencies and user- specific dependencies.
> 
> Hmmm

RE: How can I get the page number of a word document?

2017-12-07 Thread Allison, Timothy B.

MSWord calculates pages dynamically.  Unlike PDFs, MSWord documents are not 
“page based”.  The only way you can do it with Java is through n COM bridge to 
the MSWord application or maybe via OpenOffice UNO, etc.  If you have vba, you 
could also programmatically get it out via the MSWord application.

From: 张钧荣 [mailto:1024238...@qq.com]
Sent: Thursday, December 7, 2017 5:04 AM
To: user@tika.apache.org
Subject: How can I get the page number of a word document?

Hi,how can I get the page number of a word document?

Thanks
ZJR

发送自 Windows 10 版邮件应用

RE: How can I get the page number of a word document?

2017-12-08 Thread Allison, Timothy B.

And one other thing, because MSWord calculates pages dynamically, it often does 
not store the correct page count within the file, so that information is often 
misleading.  Beware.

From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, December 7, 2017 8:43 AM
To: user@tika.apache.org
Subject: RE: How can I get the page number of a word document?

MSWord calculates pages dynamically.  Unlike PDFs, MSWord documents are not 
“page based”.  The only way you can do it with Java is through n COM bridge to 
the MSWord application or maybe via OpenOffice UNO, etc.  If you have vba, you 
could also programmatically get it out via the MSWord application.

From: 张钧荣 [mailto:1024238...@qq.com]
Sent: Thursday, December 7, 2017 5:04 AM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: How can I get the page number of a word document?

Hi,how can I get the page number of a word document?

Thanks
ZJR

发送自 Windows 10 版邮件<https://go.microsoft.com/fwlink/?LinkId=550986>应用

RE: [VOTE] Release Apache Tika 1.17 Candidate #2

2017-12-12 Thread Allison, Timothy B.

Thank you, Luis!

One more vote, and we can release…

From: Luís Filipe Nassif [mailto:lfcnas...@gmail.com]
Sent: Tuesday, December 12, 2017 8:43 AM
To: user@tika.apache.org
Cc: d...@tika.apache.org
Subject: Re: [VOTE] Release Apache Tika 1.17 Candidate #2

All seems ok after integrating in our system and testing with our limited 
regression corpus.

Luis

2017-12-11 13:13 GMT-02:00 Luís Filipe Nassif 
mailto:lfcnas...@gmail.com>>:
Built on Windows 10 Pro with jdk 1.8.0_152 x64, all tests passed. So +1 from me.

PS: Running regression test on our 1M forensic test corpus...

Luis

2017-12-08 22:43 GMT-02:00 Tim Allison 
mailto:talli...@apache.org>>:

On Friday, December 8, 2017, 7:43:05 PM EST, Tim Allison 
mailto:tallison_apa...@yahoo.com>> wrote:

A candidate for the Tika 1.17 release is available at:
  https://dist.apache.org/repos/dist/dev/tika/

The release candidate is a zip archive of the sources in:
  https://github.com/apache/tika/tree/1.17-rc2/

The SHA1 checksum of the archive is
  c6a267956e82365c3a2b456819205763921f2f9d.

In addition, a staged maven repository is available here:

https://repository.apache.org/content/repositories/orgapachetika-1028/org/apache/tika/

Please vote on releasing this package as Apache Tika 1.17.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.17
[ ] -1 Do not release this package because...

+1 for me

RE: Parse file without creating tmp file

2018-01-11 Thread Allison, Timothy B.

I'm not aware of such a list.  Part of the challenge is that we don't know when 
our dependencies might choose to create a temp file.

Sorry!

-Original Message-
From: Van Tassell, Kristian [mailto:kristian.vantass...@siemens.com] 
Sent: Thursday, January 11, 2018 1:42 PM
To: user@tika.apache.org
Subject: RE: Parse file without creating tmp file

Apologies for bumping such an old thread, but is there an official list 
somewhere of those filetypes that require the temporary file being created?

Thanks!

-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org]
Sent: Tuesday, July 11, 2017 4:23 AM
To: user@tika.apache.org
Subject: Re: Parse file without creating tmp file

On Tue, 11 Jul 2017, aravinth thangasami wrote:
> Recently I have noticed tika creates a tmp file in before parsing the 
> stream.

Only for certain formats, generally where the underlying parsing library 
requires a file for random-access

> I don't have much experience in Tika but I feel it is an overhead.
> Can we achieve file parsing without writing to tmp file?

For some files, no, not without re-writing other open source libraries

For most, it isn't needed and Tika won't do it

Nick

RE: How to implement an InputStream that dynamically guesses the extension of a file that is streamed using Apache Tika?

2018-01-11 Thread Allison, Timothy B.

Hi Martin,
I’m sorry for my delay.  As a first pass at an answer…We have roughly three 
mechanisms for file id:


  1.  mime patterns (magic mime)
  2.  package detection
  3.  parse-time sub-type detection
  4.  file name extension (completely useless for your purposes)


  1.  You should be able to use the mime patterns in a buffered single read.  
Buffer the first 1024 bytes or so and run our mime detection.
  2.  We are currently opening the zip/package file and looking for particular 
files within the zip/package files e.g. docx, xlsx…etc, which requires the 
whole file and cannot be done by our current methods in a streaming fashion.  I 
don’t see a way around parsing the package/container file
  3.  IIRC, some of our parsers update the mime based on knowledge of that 
particular format’s subtypes/actually parsing the file (doc, ppt and …?) …so 
these would be a non-starter.

Regrettably, AFAIK, at least from a Tika perspective, there is no silver bullet.

Instead of having to spool the complete file to memory (or disk) and then run 
detection (or having Tika do that) for every file, I wonder if you could run 1) 
(mime magic detection) on the stream, and, if that returns something obvious, 
go with that, otherwise spool to disk and then run regular Tika on that subset 
of files.

Nick Burch will probably have better insight on this than my ramblings above.

From: Martin Todorov [mailto:carlspr...@gmail.com]
Sent: Thursday, January 4, 2018 8:48 PM
To: user@tika.apache.org
Subject: How to implement an InputStream that dynamically guesses the extension 
of a file that is streamed using Apache Tika?


Hi,

I have asked this on 
Stackoverflow
 and was pointed here, with the hope that more people would be able to help.

We have a custom implementation of an InputStream that can currently update 
multiple MessageDigest-s and while reading the data. This allows for a single 
reading and processing of the data and avoids having to re-read files in order 
to calculate their checksums. This is quite efficient and saves time (and is 
implemented in 
here).

As a follow-up step, we'd like to use Apache Tika to guess the file extension 
from the stream, which is sent over HTTP. I know some of you will suggest 
simply setting the Content-Type header and requiring that it's set, but, 
unfortunately, for various reasons, we cannot rely on this, or enforce it. 
Hence, I'm looking for a way to guess the extension based on the InputStream, 
while it's being sent.

We also need to be able to guess complex extension types (such as tar.gz, 
tar.bz2 and other similar ones that aren't easy to guess by just doing a 
substring from the last index of the dot until the end of the string).

What is the most-efficient way to do this? We cannot afford to read the whole 
files in memory, as the application will have to be able to handle a large 
number of concurrent requests. Could somebody please provide an example, of how 
this could be done?

We have an open issue and a 
pull request 
here,
 if anyone would like to have a closer look and help out.

Looking forward to your suggestions and replies!
Kind regards,

Martin Todorov

RE: Tika-parsers using cat-x json.org dep and is geoapis ok?

2018-01-23 Thread Allison, Timothy B.

Fixed via TIKA-2535 in both 1.18 and 2.0.  Thank you, Joe and Chris!

-Original Message-
From: Joe Witt [mailto:joe.w...@gmail.com] 
Sent: Tuesday, January 23, 2018 10:46 AM
To: user@tika.apache.org
Subject: Re: Tika-parsers using cat-x json.org dep and is geoapis ok?

Here is the legal JIRA to ask about the categorization of that license
https://issues.apache.org/jira/browse/LEGAL-360

Thanks
Joe

On Tue, Jan 23, 2018 at 10:04 AM, Joe Witt  wrote:
> i said yep to be agreeable then thought...
>
> the problem dependency isnt coming in transitively via apache sis.  it 
> is a dependency tika parsers pulls in itself via geoapis.
>
> https://github.com/apache/tika/blob/master/tika-parsers/pom.xml
>
> ill raise the license question on legal but ill avoid bugging the sis 
> folks just yet.
>
> thanks
> joe
>
> On Jan 23, 2018 9:58 AM, "Joe Witt"  wrote:
>>
>> yep.  sounds fair
>>
>> On Jan 23, 2018 9:52 AM, "Chris Mattmann"  wrote:
>>>
>>> Hi Joe,
>>>
>>> Great analysis.
>>>
>>> Can you do me a favor:
>>>
>>> 1. Raise a LEGAL JIRA with the below insight.
>>> 2. Contact the Apache SIS PMC and ask them how they dealt with it? 
>>> SIS is an ASF project and is expected to be following ASF release 
>>> guidelines which gives me confidence in the product (and its 
>>> dependencies that they ship). Martin Desruisseaux is an ASF member 
>>> and their Chair and is very thorough I'm sure they ran into this and 
>>> have some idea.
>>>
>>> Tika should action (or not) based on #1 and 2 above. Sound good?
>>>
>>> Cheers,
>>> Chris Mattmann
>>> (wearing his VP, Legal hat).
>>>
>>>
>>> On 1/23/18, 5:53 AM, "Joe Witt"  wrote:
>>>
>>> Chris
>>>
>>> Bottom line up front: Is
>>> https://github.com/unitsofmeasurement/jsr-275/blob/0.9.3/LICENSE.txt
>>> Category A or Category B?
>>>
>>>
>>> ** a bunch of words to explain why I'm asking 
>>>
>>> I truly do not wish to create a problem where there is none.  L&N is
>>> truly a painful thing.  That said, based on my experience and current
>>> understanding of ASF policies and guidance I do believe there is a
>>> problem.
>>>
>>> If you think this thread is better on legal-discuss please let me
>>> know.  My hope in starting the thread here was to get a 'yep 
>>> this is a
>>> known thing - we cleared it with legal - here is a mailing list 
>>> thread
>>> or JIRA or something'.
>>>
>>> What I believe to be true is that there are binary artifacts 
>>> which are
>>> under licenses.  Those licenses are either compatible with the ASF
>>> legal policy or they are not and specifically they're either 
>>> listed as
>>> Category-A or Category-B from
>>> https://www.apache.org/legal/resolved.html.  If they're not you 
>>> cannot
>>> use them as binary dependencies until they are on that list.
>>>
>>> What is also true is that apache-tika-parsers version 1.16 (at least)
>>> depends on org/opengis/geoapi 3.0.0 which depends on
>>> javax.measure.jsr-275:0.9.3.  That artifact appears to be under this
>>> license:
>>> https://github.com/unitsofmeasurement/jsr-275/blob/0.9.3/LICENSE.txt.
>>>
>>> Plainly, from my quick read and review that binary artifact
>>> (jsr-275:0.9.3) does not appear to be a Category A or Category B
>>> license.  Do you believe it is?  If yes which Cat-A/Cat-B is it
>>> considered to be?  Is there a mailing list thread, Legal-Discuss, or
>>> L&N entry in Tika that calls this out so I can reference that?
>>>
>>> Now for more general background:
>>> There are all kinds of threads on the Internet about the 
>>> problems with
>>> JSR-275 and that JSR-363 is the way to go to move on with regard to
>>> the unit of measure work, etc..
>>>
>>> If you look at the source for opengis/geoapi which I believe is here
>>> https://github.com/opengeospatial/geoapi/tree/3.0.0 which is what
>>> tika-parsers uses then it will pull in the jsr-275:jar:0.9.3.
>>>
>>> If you look at the source for opengis/geoapi for latest milestone
>>> release https://github.com/opengeospatial/geoapi/tree/4.0-M06 you can
>>> see they've moved on from JSR-275 and now use JSR-363.
>>>
>>> Further, the Apache SIS project in their Nov 2017 release 0.8
>>> (Tika-parsers 1.16 uses apache sis 0.6) clearly stated in their 
>>> NOTICE
>>> they depend on JSR-363.  Not sure if they were specifically 
>>> relying on
>>> JSR-275 before that or not as it isn't called out.
>>>
>>> Thanks
>>> Joe
>>>
>>> On Mon, Jan 22, 2018 at 11:43 AM, Chris Mattmann 
>>>  wrote:
>>> > Hi Joe,
>>> >
>>> >
>>> >
>>> > My quick read on the license is that it’s a spec jar in a 
>>> transitive
>>> > dependency. SIS has made
>>> > many releases and is an ASF project (of which this JSR 275 is 
>>> one dependency
>>> > that I believe is
>>> > just the JSR spec API). I think you’re fine to use Tika and to 
>>> use SIS in
>>>

RE: Long time with OCR

2018-02-20 Thread Allison, Timothy B.

> These pages are hard because they have different fonts and maybe other 
> complications.

+1 … As a side note, a colleague and I did an image degradation study, and we 
noticed that tesseract took far longer on the degraded images than on the 
originals.  Your intuition is correct.  This won’t help improve your speed, but 
I thought I’d share.

From: Chris Mattmann [mailto:mattm...@apache.org]
Sent: Tuesday, February 20, 2018 12:31 PM
To: user@tika.apache.org
Subject: Re: Long time with OCR

Updated the wiki page with this info, thanks Nick!

From: Mark Kerzner mailto:mark.kerz...@shmsoft.com>>
Reply-To: "user@tika.apache.org" 
mailto:user@tika.apache.org>>
Date: Tuesday, February 20, 2018 at 6:36 AM
To: Tika User mailto:user@tika.apache.org>>
Subject: Re: Long time with OCR

Hi, Nick,

Thank you very much.

Mark

Mark Kerzner, SHMsoft,
Book a call with me here

Mobile: 713-724-2534
Skype: mark.kerzner1

On Tue, Feb 20, 2018 at 6:59 AM, Nick Burch 
mailto:apa...@gagravarr.org>> wrote:
On Mon, 19 Feb 2018, Mark Kerzner wrote:
Is that a good approach? Is the 10 seconds time normal? I am using the latest 
most powerful Mac and I get similar results on an i7 processor in Ubuntu.

Tika uses the open source Tesseract OCR engine. Tesseract is optimised for ease 
of contributions and ease of implementing new approaches, rather than for 
performance, because as an (ex?-) accademic project that's more what they 
think's important

There's some advice on the Tesseract github issues + wiki on ways to speed it 
up, eg https://github.com/tesseract-ocr/tesseract/issues/263 and
https://github.com/tesseract-ocr/tesseract/issues/1171 and
https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance

Otherwise you'd need to switch to a proprietary OCR tool. I understand that the 
Google Cloud OCR is pretty good, if you don't mind pushing all your files up to 
Gooogle and paying per file

Nick

RE: Malware RTF is not detected as RTF

2018-03-01 Thread Allison, Timothy B.

Yes.  Please do open a ticket, and y, I have a need to read anything from 
decalage…he does some amazing work. 😊

I trust you wouldn’t, but please don’t post an actual malware file for us to 
use in our unit tests. 😉

From: Jim Idle [mailto:ji...@proofpoint.com]
Sent: Thursday, March 1, 2018 12:32 AM
To: user@tika.apache.org
Subject: Malware RTF is not detected as RTF

I can open a ticket for this but wanted to just run it by you first.

As explained here: http://www.decalage.info/rtf_tricks  (no need to read if you 
 don’t care 😉

Malicious  RTF files take advantage of the fact that Microsoft do not follow 
their own RTF spec. Specifically, Word et al only looks for the opening 
sequence:

{rt

Thought the spec says it should be:

{rtf1

Where 1 is the version number.

Tika fails to identify a malware file starting:

{\rtf1{\pict\jpegblip\picw24\pich24\bin49922

As an RTF file – it says that it is application/octet-stream

Could the Tika detector be modified to just look for {rt as per Office tools?

Cheers,

Jim

RE: XBRL documents.

2018-03-14 Thread Allison, Timothy B.

Tika's default handling of xml is to scrape out the text and ignore the 
entities and attributes, IIRC.  So, if that's the behavior you want, and your 
XBRLs are well-formed XML, you'll be good to go.

If they're non-standard XML or if you want the node names and attributes, you 
may have to add your own parser, which should be straightforward[1].

The best way to see what Tika will do is to download tika-app[2], start up the 
GUI and drop in a file to see what you get.

[1] https://tika.apache.org/1.17/parser_guide.html
[2] http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.17.jar

From: Johnson, Jaya [mailto:jaya.john...@moodys.com]
Sent: Tuesday, March 13, 2018 5:06 PM
To: user@tika.apache.org
Subject: XBRL documents.

Can Tika parse XBRL documents it's a variation of an XML document.

Thanks.
-
Moody's monitors email communications through its networks for regulatory 
compliance purposes and to protect its customers, employees and business and 
where allowed to do so by applicable law. The information contained in this 
e-mail message, and any attachment thereto, is confidential and may not be 
disclosed without our express permission. If you are not the intended recipient 
or an employee or agent responsible for delivering this message to the intended 
recipient, you are hereby notified that you have received this message in error 
and that any review, dissemination, distribution or copying of this message, or 
any attachment thereto, in whole or in part, is strictly prohibited. If you 
have received this message in error, please immediately notify us by telephone, 
fax or e-mail and delete the message and all of its attachments. Every effort 
is made to keep our network free from viruses. You should, however, review this 
e-mail message, as well as any attachment thereto, for viruses. We take no 
responsibility and have no liability for any computer virus which may be 
transferred via this e-mail message.
-

RE: Subfile Extraction

2018-03-27 Thread Allison, Timothy B.

+1 to Nick's links and advice.

To use the RecursiveParserWrapper with tika-app, use the -J option; or if 
you're using tika-server, use the /rmeta endpoint.

The ecology of embedded docs is rich and understudied (IMHO), let us know what 
you find!

Cheers,

  Tim

-Original Message-
From: McGreevy, Anthony [mailto:anthony.mcgre...@microfocus.com] 
Sent: Tuesday, March 27, 2018 11:47 AM
To: user@tika.apache.org
Subject: RE: Subfile Extraction

Thanks for the information!

Much appreciated!

Anthony

-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org]
Sent: 27 March 2018 15:50
To: user@tika.apache.org
Subject: Re: Subfile Extraction

On Sun, 25 Mar 2018, McGreevy, Anthony wrote:
> I am currently playing with Tika to see how it works with regards to 
> extraction of subfiles.

Do you mean files or resources embedded within another file?

If so... With the Tika App, you want -z to have these extracted. With the Tika 
java classes, you want to pop something like a 
https://tika.apache.org/1.17/api/org/apache/tika/parser/RecursiveParserWrapper.htmlhttps://tika.apache.org/1.17/api/org/apache/tika/parser/RecursiveParserWrapper.html
or a
https://tika.apache.org/1.17/api/org/apache/tika/extractor/ContainerExtractor.html
on your ParseContext to get called for embedded resources. See 
https://wiki.apache.org/tika/RecursiveMetadata for more on how it works and how 
to have Tika parse + return all the embedded files and resources

Nick

RE: Tika Server: Disable OCR / Tesseract by HTTP parameter?

2018-04-11 Thread Allison, Timothy B.

Others may be more familiar with tika-server and OCR, but I notice that we do 
process X-Tika-OCR prefixed headers to configure TesseractOCRConfig .  If you 
set "tesseractPath" to something bogus, that may turn off OCR...give something 
like this a try:

--header "X-Tika-OCRTesseractPath:/bogosity"

-Original Message-
From: Markus Mandalka [mailto:t...@mandalka.name] 
Sent: Monday, April 9, 2018 11:00 AM
To: user@tika.apache.org
Subject: Tika Server: Disable OCR / Tesseract by HTTP parameter?

Hello,


is it possible to optional disable OCR (by Tesseract) in Tika Server by a HTTP 
/ HEADER / POST / GET / REST-API parameter (i.e. for some but not all 
documents) instead of disable it global in tika.xml config or uninstall 
Tesseract like in the documentation https://wiki.apache.org/tika/TikaOCR ?


Best regards,

Markus

FW: Default Tika extraction of docx 5X slower than XWPFWordExtractor?

2012-01-20 Thread Allison, Timothy B.

  I'm just getting started with Tika, and I tried the basic AutoDetectParser 
and the basic ParsingReader on a batch of a few thousand docx files (tika-app 
v1.0).  On my laptop, I was able to extract text at a rate of 200 docs per 
minute.  When I ran XWPFWordExtractor (poi 3.8) on the same docs, the rate was 
1000 docs per minute.  Is there a faster way to use Tika to extract text from a 
file?  Is this performance difference expected and/or experienced by others?

 Thank you.

pdf acroform and tika

2012-02-23 Thread Allison, Timothy B.

Not sure if this is an issue for PDFBox or Tika, but I noticed that PDFBox's 
textstripper is not extracting information from the form fields in a batch of 
pdf documents I'm processing.  Is anyone else having this problem?
I regret that I'm unable to send an example document.
Inelegant solution with error handling not included:
StringBuilder sb = new StringBuilder();
//get text with text stripper and then
PDDocumentCatalog catalog = pdDoc.getDocumentCatalog();
if (catalog != null){
   PDAcroForm form = catalog.getAcroForm();
   if (form != null){
 List fields = form.getFields();
 for (PDField field : fields){
sb.append(field.getFullyQualifiedName() +": "+ 
field.getValue()+"\r\n");
 }
   }
}

BodyContentHandler and a docx embedded within a PDF

2013-05-22 Thread Allison, Timothy B.

I have a PDF document with a docx attachment.  I wasn't having luck getting the 
contents of the docx with tika.parseToString(file).

I dug around a bit in the PDFExtractor and found that when I changed this line:
embeddedExtractor.parseEmbedded(
 stream,
new EmbeddedContentHandler(new BodyContentHandler(localHandler)),
metadata, 
false);
to:

embeddedExtractor.parseEmbedded(
 stream,
 new EmbeddedContentHandler(handler),
metadata, 
false);

in other words, when I no longer required "body" elements, I was able to get 
the content of the attached document.

I attached the same inner document to a docx file and had luck without this 
change.   Does anyone know why this change is required in PDFExtractor?  Is 
this a bad solution?

Unfortunately, I can't share the documents.

   Best,

   Tim

RE: Html Parser autodetect charset

2013-06-21 Thread Allison, Timothy B.

In the tika-app.jar, go to WEB-INF/services; there's a file that specifies the 
order of the application of the encoding detectors 
(org.apache.tika.detect.EncodingDetector).  The AutoDetectReader applies these 
in order and stops as soon as one of the detectors thinks that it detects an 
encoding.

If you flip the order so that icu4j is first (as below), you should be set.

org.apache.tika.parser.txt.Icu4jEncodingDetector
org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector

You could also create your own dummy EncodingDetector (always returns "UTF-8") 
and register it in the service file.

From: Dave French [mailto:dfre...@jsitelecom.com]
Sent: Thursday, June 20, 2013 11:33 AM
To: user@tika.apache.org
Subject: Html Parser autodetect charset

Hey,

In my use case of tika, I am rendering a webpage, taking the contents of the 
page and feeding this into tika.  The contents of the webpage are encoded in 
UTF-8 when I feed it into tika, but the HtmlParser is using the 
AutoDetectReader to try and determine the charset.  This means tika is using 
the meta-data tag of the page to determine the charset.

Is there a way to not use this AutoDetectReader and just specify the charset?  
Or better yet, inject the Detector that will be used?

Thanks for your help,
Dave

RE: How to extract autoshape text in Excel 2007+

2013-07-22 Thread Allison, Timothy B.

This looks like an area for a  new feature in both Tika and POI.  I've only 
looked very briefly into the POI libraries, and I may have missed how to 
extract text from autoshapes.  I'll open an issue in both projects.

-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp] 
Sent: Sunday, July 21, 2013 10:16 AM
To: user@tika.apache.org
Subject: How to extract autoshape text in Excel 2007+

Hi,

I am using Tika 1.3 and Solr 4.3.1.
I'd like to extract autoshape text in Excel 2007+(.xlsx), but I can't.

I tried to extract from some MS office files.
The results are below.

Success (I can extract autoshape text.)
- Excel 2003(.xls)
- Word 2003(.doc)
- Word 2007+(.docx)

Failed (I cannot extract autoshape text.)
- Excel 2007+(.xlsx)

Is this a bug?
If you know, could you tell me how to extract autoshape text in Excel 2007+?

Thanks,
Hiro.

RE: How to extract autoshape text in Excel 2007+

2013-07-22 Thread Allison, Timothy B.

This is one way to access the underlying CTShape that contains the text:

XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(f));
XSSFSheet sheet = wb.getSheetAt(0);
XSSFDrawing drawing = sheet.createDrawingPatriarch();
for (XSSFShape shape : drawing.getShapes()){
   if (shape instanceof XSSFSimpleShape){
  XSSFSimpleShape simple = ((XSSFSimpleShape)shape);
  System.out.println("CT: "+simple.getCTShape());
   }
}

Hiroshi, If this is a high priority, you could extract the txBody element with 
some bean work.  I've opened https://issues.apache.org/jira/browse/TIKA-1150 
for the longer term fix.

There's some work going on on XSSFTextCell in POI that might make this more 
straightforward.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, July 22, 2013 8:50 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This looks like an area for a  new feature in both Tika and POI.  I've only 
looked very briefly into the POI libraries, and I may have missed how to 
extract text from autoshapes.  I'll open an issue in both projects.

-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp] 
Sent: Sunday, July 21, 2013 10:16 AM
To: user@tika.apache.org
Subject: How to extract autoshape text in Excel 2007+

Hi,

I am using Tika 1.3 and Solr 4.3.1.
I'd like to extract autoshape text in Excel 2007+(.xlsx), but I can't.

I tried to extract from some MS office files.
The results are below.

Success (I can extract autoshape text.)
- Excel 2003(.xls)
- Word 2003(.doc)
- Word 2007+(.docx)

Failed (I cannot extract autoshape text.)
- Excel 2007+(.xlsx)

Is this a bug?
If you know, could you tell me how to extract autoshape text in Excel 2007+?

Thanks,
Hiro.

RE: How to extract autoshape text in Excel 2007+

2013-07-22 Thread Allison, Timothy B.

Hiroshi,
   To fix this on your own will take quite a bit of work.  I give details below 
if you do want to go this route.

The longer term path, I think is:
1) https://issues.apache.org/bugzilla/show_bug.cgi?id=55292 will be committed 
to POI.
2) A new release of POI will be made.
3) Small fixes to Tika's Excel parser will be made to take advantage of the new 
functionality in POI55292.

Others on the list may have a simpler solution, but this is what I had to do 
before https://issues.apache.org/jira/browse/TIKA-1130 was committed.

This is a very unappetizing solution; beware of dragons and don't try this at 
work.  Your steps will differ somewhat because you're working with xlsx vs 
docx.  I'm sure that I don't remember each step.

1) Modify the underlying POI code to expose a getText() or similar 
functionality on the object of interest to me (in my original email, I gave 
some hint of how to do this)

2) Modify XWPFWordExtractorDecorator to take advantage of getText() in the 
underlying POI object.

There are several options for how to tie it all together.
3)  I chose to copy and paste into a different namespace 
XWPFWordExtractorDecorator and the following classes:
OOXMLExtractorFactory
4) Modify the above to call your new version of XWPFWordExtractorDecorator
5) Finally, register your new office parser in 
tika-parsers/META-INF/org.apache.tika.parser.Parser



-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp] 
Sent: Monday, July 22, 2013 11:42 AM
To: user@tika.apache.org
Subject: Re: How to extract autoshape text in Excel 2007+

Thank you for your reply. I really appreciate it.
This is a high priority for me.
Because we use solr, and our customer wants to search autoshapes' text in 
Excel 2007+ files.

I've been investigating the Tika source code, and trying to fix it.
I understand that I can extract text from autoshapes with XSSFWorkbook.
But first, I think the problem is ExcelExtractor's listenForAllRecords 
field.
I set "true" to listenForAllRecords field like below, but it didn't work 
properly.
-
Class: org.apache.tika.parser.microsoft.OfficeParser
Method: protected void parse(
DirectoryNode root, ParseContext context, Metadata metadata, 
XHTMLContentHandler xhtml)
TargetCode:
case XLR:
   Locale locale = context.get(Locale.class, Locale.getDefault());
   ExcelExtractor ee = new ExcelExtractor(context);
   ee.setListenForAllRecords(true);
   ee.parse(root, xhtml, locale);
   // original code
   // new ExcelExtractor(context).parse(root, xhtml, locale);
   break;
-

Is this a wrong direction?
If you know which class I should fix, please let me know.



-Original Message- 
From: Allison, Timothy B.
Sent: Monday, July 22, 2013 10:27 PM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This is one way to access the underlying CTShape that contains the text:

XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(f));
XSSFSheet sheet = wb.getSheetAt(0);
XSSFDrawing drawing = sheet.createDrawingPatriarch();
for (XSSFShape shape : drawing.getShapes()){
   if (shape instanceof XSSFSimpleShape){
  XSSFSimpleShape simple = ((XSSFSimpleShape)shape);
  System.out.println("CT: "+simple.getCTShape());
   }
}

Hiroshi, If this is a high priority, you could extract the txBody element 
with some bean work.  I've opened 
https://issues.apache.org/jira/browse/TIKA-1150 for the longer term fix.

There's some work going on on XSSFTextCell in POI that might make this more 
straightforward.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, July 22, 2013 8:50 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This looks like an area for a  new feature in both Tika and POI.  I've only 
looked very briefly into the POI libraries, and I may have missed how to 
extract text from autoshapes.  I'll open an issue in both projects.

-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp]
Sent: Sunday, July 21, 2013 10:16 AM
To: user@tika.apache.org
Subject: How to extract autoshape text in Excel 2007+

Hi,

I am using Tika 1.3 and Solr 4.3.1.
I'd like to extract autoshape text in Excel 2007+(.xlsx), but I can't.

I tried to extract from some MS office files.
The results are below.

Success (I can extract autoshape text.)
- Excel 2003(.xls)
- Word 2003(.doc)
- Word 2007+(.docx)

Failed (I cannot extract autoshape text.)
- Excel 2007+(.xlsx)

Is this a bug?
If you know, could you tell me how to extract autoshape text in Excel 2007+?

Thanks,
Hiro.

RE: How to extract autoshape text in Excel 2007+

2013-09-26 Thread Allison, Timothy B.

Fixed now.  Build from current trunk (r1526498) or pull from 
https://builds.apache.org/job/Tika-trunk/lastStableBuild/ after Jenkins has had 
a chance to build.

Best,

   Tim

-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp] 
Sent: Wednesday, September 25, 2013 6:30 PM
To: user@tika.apache.org
Subject: Re: How to extract autoshape text in Excel 2007+

Hi,

I'm waiting for the fix of this bug.
https://issues.apache.org/jira/browse/TIKA-1100

The POI's bug which is referenced in this issue has fixed already.
http://issues.apache.org/bugzilla/show_bug.cgi?id=55292

It would be great if you could give me a patch.


Thanks,
Hiroshi Tatsumi



-Original Message----- 
From: Allison, Timothy B.
Sent: Tuesday, July 23, 2013 5:10 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

Hiroshi,
   To fix this on your own will take quite a bit of work.  I give details 
below if you do want to go this route.

The longer term path, I think is:
1) https://issues.apache.org/bugzilla/show_bug.cgi?id=55292 will be 
committed to POI.
2) A new release of POI will be made.
3) Small fixes to Tika's Excel parser will be made to take advantage of the 
new functionality in POI55292.

Others on the list may have a simpler solution, but this is what I had to do 
before https://issues.apache.org/jira/browse/TIKA-1130 was committed.

This is a very unappetizing solution; beware of dragons and don't try this 
at work.  Your steps will differ somewhat because you're working with xlsx 
vs docx.  I'm sure that I don't remember each step.

1) Modify the underlying POI code to expose a getText() or similar 
functionality on the object of interest to me (in my original email, I gave 
some hint of how to do this)

2) Modify XWPFWordExtractorDecorator to take advantage of getText() in the 
underlying POI object.

There are several options for how to tie it all together.
3)  I chose to copy and paste into a different namespace 
XWPFWordExtractorDecorator and the following classes:
OOXMLExtractorFactory
4) Modify the above to call your new version of XWPFWordExtractorDecorator
5) Finally, register your new office parser in 
tika-parsers/META-INF/org.apache.tika.parser.Parser



-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp]
Sent: Monday, July 22, 2013 11:42 AM
To: user@tika.apache.org
Subject: Re: How to extract autoshape text in Excel 2007+

Thank you for your reply. I really appreciate it.
This is a high priority for me.
Because we use solr, and our customer wants to search autoshapes' text in
Excel 2007+ files.

I've been investigating the Tika source code, and trying to fix it.
I understand that I can extract text from autoshapes with XSSFWorkbook.
But first, I think the problem is ExcelExtractor's listenForAllRecords
field.
I set "true" to listenForAllRecords field like below, but it didn't work
properly.
-
Class: org.apache.tika.parser.microsoft.OfficeParser
Method: protected void parse(
DirectoryNode root, ParseContext context, Metadata metadata,
XHTMLContentHandler xhtml)
TargetCode:
case XLR:
   Locale locale = context.get(Locale.class, Locale.getDefault());
   ExcelExtractor ee = new ExcelExtractor(context);
   ee.setListenForAllRecords(true);
   ee.parse(root, xhtml, locale);
   // original code
   // new ExcelExtractor(context).parse(root, xhtml, locale);
   break;
-

Is this a wrong direction?
If you know which class I should fix, please let me know.



-Original Message- 
From: Allison, Timothy B.
Sent: Monday, July 22, 2013 10:27 PM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This is one way to access the underlying CTShape that contains the text:

XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(f));
XSSFSheet sheet = wb.getSheetAt(0);
XSSFDrawing drawing = sheet.createDrawingPatriarch();
for (XSSFShape shape : drawing.getShapes()){
   if (shape instanceof XSSFSimpleShape){
  XSSFSimpleShape simple = ((XSSFSimpleShape)shape);
  System.out.println("CT: "+simple.getCTShape());
   }
}

Hiroshi, If this is a high priority, you could extract the txBody element
with some bean work.  I've opened
https://issues.apache.org/jira/browse/TIKA-1150 for the longer term fix.

There's some work going on on XSSFTextCell in POI that might make this more
straightforward.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, July 22, 2013 8:50 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This looks like an area for a  new feature in both Tika and POI.  I've only
looked very briefly into the POI libraries, and I may have missed how to
extract text from a

tika server jax-rs and recursive file processing

2014-04-30 Thread Allison, Timothy B.

All,
  As always, apologies for the cluelessness the following reveals... I'm 
starting to move from embedded Tika to a server option for greater robustness.  
Is the jax-rs server intended not to handle embedded files recursively?  If so, 
how are users currently handling multiply embedded documents with the jax-rs 
server?  Would it be worthwhile to add another service that uses 
AutoDetectParser as the embedded parser/extractor instead of 
MyEmbeddedDocumentExtractor?

Best,

   Tim

Timothy B. Allison, Ph.D.
Lead Artificial Intelligence Engineer
Group Lead
K83A/Human Language Technology
The MITRE Corporation
7515 Colshire Drive, McLean, VA  22102
703-983-2473 (phone); 703-983-1379 (fax)

RE: Question re installing Tika

2014-06-26 Thread Allison, Timothy B.

My plan is to add a tika-batch package as part of TIKA-1330.  One of the 
primary use cases will be input directory -> output directory.  There will be 
hooks for people to add db -> db, and maybe someone with Hadoop skills would be 
willing to contribute a tika-batch-hadoop package.

That should be ready by the end of this coming week.

But, bat scripting is far simpler.

-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Thursday, June 26, 2014 8:58 AM
To: user@tika.apache.org
Subject: Re: Question re installing Tika

+1000 I'm not the Windows guru, but will try and look it up




-Original Message-
From: Nick Burch 
Reply-To: "user@tika.apache.org" 
Date: Thursday, June 26, 2014 5:55 AM
To: "user@tika.apache.org" 
Subject: Re: Question re installing Tika

>On Thu, 26 Jun 2014, Chris Mattmann wrote:
>> looks like a great example to put on the website too ;)
>
>To be fair to all users, we probably ought to have an example that works
>on windows as well. Any powershell gurus around who care to take a stab
>at 
>the windows equivalent?
>
>Nick
>
>> -Original Message-
>> From: Nick Burch 
>> Reply-To: 
>> Date: Thursday, June 26, 2014 5:23 AM
>> To: "user@tika.apache.org" 
>> Subject: RE: Question re installing Tika
>>
>>> On Thu, 26 Jun 2014, Richard wrote:
 You haven't by chance happen to have programmatically looped through a
 directory full of pdfs and used Tika to extract each of their pdf
 contents into separate text or xml files? If so, what do you recommend
 to do the extraction?
>>>
>>> For a proof of concept, how about something simple like a bash for loop
>>> and the tika app?
>>>
>>> for i in *.pdf; do j=`echo "$i" | sed 's/.pdf//'`; java -jar
>>>tika-app.jar
>>>   --text "$i" > "$j.txt"; java -jar tika-app.jar --xml "$i" > "$j.xml";
>>> done
>>>
>>> Nick
>>
>>
>>

RE: Stack Overflow Question

2014-06-30 Thread Allison, Timothy B.

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.



Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.





If you want to write out each embedded file as a binary, try subclassing 
EmbeddedResourceHandler.



QUOTE:
0down 
votefavorite


i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

Parser parser = new AutoDetectParser();

ParseContext context = new ParseContext();

context.set(Parser.class, parser);

ContentHandler handler = new DefaultHandler();

Metadata metadata = new Metadata();

InputStream stream = null;

try {

stream = TikaInputStream.get(new File(zipFilePath));

} catch (FileNotFoundException e) {

e.printStackTrace();

}

try {



parser.parse(stream, handler, metadata, context);



logger.info("Content:\t" + handler.toString());

} catch (IOException e) {

e.printStackTrace();

} catch (SAXException e) {

e.printStackTrace();

} catch (TikaException e) {

e.printStackTrace();

} finally {

try {

stream.close();

} catch (IOException e) {

e.printStackTrace();

}

}

in the logger statement all i see is org.xml.sax.helpers.DefaultHandler@5bd8e367

i am missing something, unable to figure it out, looking for some help




-Original Message-

From: yeshwanth kumar [mailto:yeshwant...@gmail.com]

Sent: Monday, June 30, 2014 1:28 PM

To: d...@tika.apache.org

Subject: Stack Overflow Question



Unable tp read zipfile using Apache Tika

http://stackoverflow.com/q/24495504/1899893?sem=2

RE: Stack Overflow Question

2014-06-30 Thread Allison, Timothy B.

Might want to look into RecursiveMetadata Parser
http://wiki.apache.org/tika/RecursiveMetadata

Or

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true
From: yeshwanth kumar [mailto:yeshwant...@gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

hi tim,

thanks for quick reply,

i changed the contenthandler to bodyContentHandler i got exception for maximum 
word limit,
i used -1 in the bodycontenthandler constructor,

now its another problem, filenames and content are present in string returned 
from handler.tostring()

how can i map a fileName to its content.

thanks,
yeshwanth

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. 
mailto:talli...@mitre.org>> wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.



Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.





If you want to write out each embedded file as a binary, try subclassing 
EmbeddedResourceHandler.



QUOTE:
0down 
votefavorite<http://stackoverflow.com/questions/24495504/unable-tp-read-zipfile-using-apache-tika?sem=2>


i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

Parser parser = new AutoDetectParser();

ParseContext context = new ParseContext();

context.set(Parser.class, parser);

ContentHandler handler = new DefaultHandler();

Metadata metadata = new Metadata();

InputStream stream = null;

try {

stream = TikaInputStream.get(new File(zipFilePath));

} catch (FileNotFoundException e) {

e.printStackTrace();

}

try {



parser.parse(stream, handler, metadata, context);



logger.info("Content:\t" + handler.toString());

} catch (IOException e) {

e.printStackTrace();

} catch (SAXException e) {

e.printStackTrace();

} catch (TikaException e) {

e.printStackTrace();

} finally {

try {

stream.close();

} catch (IOException e) {

e.printStackTrace();

}

}

in the logger statement all i see is 
org.xml.sax.helpers.DefaultHandler@5bd8e367<mailto:org.xml.sax.helpers.DefaultHandler@5bd8e367>

i am missing something, unable to figure it out, looking for some help




-Original Message-

From: yeshwanth kumar 
[mailto:yeshwant...@gmail.com<mailto:yeshwant...@gmail.com>]

Sent: Monday, June 30, 2014 1:28 PM

To: d...@tika.apache.org<mailto:d...@tika.apache.org>

Subject: Stack Overflow Question



Unable tp read zipfile using Apache Tika

http://stackoverflow.com/q/24495504/1899893?sem=2

RE: Stack Overflow Question

2014-06-30 Thread Allison, Timothy B.

Or use the ToXMLHandler and parse the XML?

From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, June 30, 2014 3:55 PM
To: yeshwanth kumar
Cc: user@tika.apache.org
Subject: RE: Stack Overflow Question

Might want to look into RecursiveMetadata Parser
http://wiki.apache.org/tika/RecursiveMetadata

Or

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true
From: yeshwanth kumar [mailto:yeshwant...@gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

hi tim,

thanks for quick reply,

i changed the contenthandler to bodyContentHandler i got exception for maximum 
word limit,
i used -1 in the bodycontenthandler constructor,

now its another problem, filenames and content are present in string returned 
from handler.tostring()

how can i map a fileName to its content.

thanks,
yeshwanth

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. 
mailto:talli...@mitre.org>> wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.



Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.





If you want to write out each embedded file as a binary, try subclassing 
EmbeddedResourceHandler.



QUOTE:
0down 
votefavorite<http://stackoverflow.com/questions/24495504/unable-tp-read-zipfile-using-apache-tika?sem=2>


i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

Parser parser = new AutoDetectParser();

ParseContext context = new ParseContext();

context.set(Parser.class, parser);

ContentHandler handler = new DefaultHandler();

Metadata metadata = new Metadata();

InputStream stream = null;

try {

stream = TikaInputStream.get(new File(zipFilePath));

} catch (FileNotFoundException e) {

e.printStackTrace();

}

try {



parser.parse(stream, handler, metadata, context);



logger.info("Content:\t" + handler.toString());

} catch (IOException e) {

e.printStackTrace();

} catch (SAXException e) {

e.printStackTrace();

} catch (TikaException e) {

e.printStackTrace();

} finally {

try {

stream.close();

} catch (IOException e) {

e.printStackTrace();

}

}

in the logger statement all i see is 
org.xml.sax.helpers.DefaultHandler@5bd8e367<mailto:org.xml.sax.helpers.DefaultHandler@5bd8e367>

i am missing something, unable to figure it out, looking for some help




-Original Message-

From: yeshwanth kumar 
[mailto:yeshwant...@gmail.com<mailto:yeshwant...@gmail.com>]

Sent: Monday, June 30, 2014 1:28 PM

To: d...@tika.apache.org<mailto:d...@tika.apache.org>

Subject: Stack Overflow Question



Unable tp read zipfile using Apache Tika

http://stackoverflow.com/q/24495504/1899893?sem=2

RE: Stack Overflow Question

2014-07-01 Thread Allison, Timothy B.

Did you try the ToXMLHandler?

From: yeshwanth kumar [mailto:yeshwant...@gmail.com]
Sent: Monday, June 30, 2014 4:50 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

hi tim,

i tried in all possible ways,
instead of reading entire zip file i parsed individual zipentries,
but even then i faced exceptions such as


org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.microsoft.OfficeParser@37ba3e33<mailto:org.apache.tika.parser.microsoft.OfficeParser@37ba3e33>
Caused by: java.io.IOException: Invalid header signature; read 
0x725020706968736E, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a 
valid OLE2 document

org.apache.tika.exception.TikaException: Unable to unpack document stream

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.microsoft.OfficeParser@6f0ee75a<mailto:org.apache.tika.parser.microsoft.OfficeParser@6f0ee75a>

org.apache.tika.exception.TikaException: Error creating OOXML extractor


any suggestions regarding these issues,

thanks,
yeshwanth


On Tue, Jul 1, 2014 at 2:00 AM, yeshwanth kumar 
mailto:yeshwant...@gmail.com>> wrote:

hi tim,

thanks, for sharing the resources but i am unable to figure out how to 
implement it in my code,
what i didn't understand is the flow and recursive steps, when i ran the 
RecursiveMetadataParser
it still giving the same kind of output as filenames combined with content of 
the files,

i am totally confused.

On Tue, Jul 1, 2014 at 1:29 AM, Allison, Timothy B. 
mailto:talli...@mitre.org>> wrote:
Or use the ToXMLHandler and parse the XML?

From: Allison, Timothy B. [mailto:talli...@mitre.org<mailto:talli...@mitre.org>]
Sent: Monday, June 30, 2014 3:55 PM
To: yeshwanth kumar
Cc: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: RE: Stack Overflow Question

Might want to look into RecursiveMetadata Parser
http://wiki.apache.org/tika/RecursiveMetadata

Or

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true
From: yeshwanth kumar [mailto:yeshwant...@gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

hi tim,

thanks for quick reply,

i changed the contenthandler to bodyContentHandler i got exception for maximum 
word limit,
i used -1 in the bodycontenthandler constructor,

now its another problem, filenames and content are present in string returned 
from handler.tostring()

how can i map a fileName to its content.

thanks,
yeshwanth

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. 
mailto:talli...@mitre.org>> wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.



Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.





If you want to write out each embedded file as a binary, try subclassing 
EmbeddedResourceHandler.



QUOTE:
0down 
votefavorite<http://stackoverflow.com/questions/24495504/unable-tp-read-zipfile-using-apache-tika?sem=2>


i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

Parser parser = new AutoDetectParser();

ParseContext context = new ParseContext();

context.set(Parser.class, parser);

ContentHandler handler = new DefaultHandler();

Metadata metadata = new Metadata();

InputStream stream = null;

try {

stream = TikaInputStream.get(new File(zipFilePath));

} catch (FileNotFoundException e) {

e.printStackTrace();

}

try {



parser.parse(stream, handler, metadata, context);



logger.info<http://logger.info>("Content:\t" + handler.toString());

} catch (IOException e) {

e.printStackTrace();

} catch (SAXException e) {

e.printStackTrace();

} catch (TikaException e) {

e.printStackTrace();

} finally {

try {

stream.close();

} catch (IOException e) {

e.printStackTrace();

}

}

in the logger statement all i see is 
org.xml.sax.helpers.DefaultHandler@5bd8e367<mailto:org.xml.sax.helpers.DefaultHandler@5bd8e367>

i am missing something, unable to figure it out, looking for some help




-Original Message-

From: yeshwanth kumar 
[mailto:yeshwant...@gmail.com<mailto:yeshwant...@gmail.com>]

Sent: Monday, June 30, 2014 1:28 PM

To: d...@tika.apache.org<mailto:d...@tika.apache.org>

Subject: Stack Overflow Question



Unable tp read zipfile using Apache Tika

http://stackoverflow.com/q/24495504/1899893?sem=2

RE: Stack Overflow Question

2014-07-01 Thread Allison, Timothy B.

Hmmm….

When I use the ToXMLHandler on the test doc submitted with TIKA-1329, I see 
this:


embed4.zip

embed4.txt
embed_4





That’s a text file inside of a zip file that is itself embedded.  I could see 
doing some parsing on the XML to scrape out  
contents and grab the file name from the  element.

If I committed TIKA-1329, would that be of any use to you?   That returns a 
list of metadata objects.  There is one metadata object per embedded file.  The 
text content of each file can be retrieved from each metadata object by this 
key: “tika:content.”

Best,

Tim
From: yeshwanth kumar [mailto:yeshwant...@gmail.com]
Sent: Tuesday, July 01, 2014 9:00 AM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

output is same even with ToXMLHandler

On Tue, Jul 1, 2014 at 5:59 PM, Allison, Timothy B. 
mailto:talli...@mitre.org>> wrote:
Did you try the ToXMLHandler?

From: yeshwanth kumar 
[mailto:yeshwant...@gmail.com<mailto:yeshwant...@gmail.com>]
Sent: Monday, June 30, 2014 4:50 PM

To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

hi tim,

i tried in all possible ways,
instead of reading entire zip file i parsed individual zipentries,
but even then i faced exceptions such as


org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.microsoft.OfficeParser@37ba3e33<mailto:org.apache.tika.parser.microsoft.OfficeParser@37ba3e33>
Caused by: java.io.IOException: Invalid header signature; read 
0x725020706968736E, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a 
valid OLE2 document

org.apache.tika.exception.TikaException: Unable to unpack document stream

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.microsoft.OfficeParser@6f0ee75a<mailto:org.apache.tika.parser.microsoft.OfficeParser@6f0ee75a>

org.apache.tika.exception.TikaException: Error creating OOXML extractor


any suggestions regarding these issues,

thanks,
yeshwanth


On Tue, Jul 1, 2014 at 2:00 AM, yeshwanth kumar 
mailto:yeshwant...@gmail.com>> wrote:

hi tim,

thanks, for sharing the resources but i am unable to figure out how to 
implement it in my code,
what i didn't understand is the flow and recursive steps, when i ran the 
RecursiveMetadataParser
it still giving the same kind of output as filenames combined with content of 
the files,

i am totally confused.

On Tue, Jul 1, 2014 at 1:29 AM, Allison, Timothy B. 
mailto:talli...@mitre.org>> wrote:
Or use the ToXMLHandler and parse the XML?

From: Allison, Timothy B. [mailto:talli...@mitre.org<mailto:talli...@mitre.org>]
Sent: Monday, June 30, 2014 3:55 PM
To: yeshwanth kumar
Cc: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: RE: Stack Overflow Question

Might want to look into RecursiveMetadata Parser
http://wiki.apache.org/tika/RecursiveMetadata

Or

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true
From: yeshwanth kumar [mailto:yeshwant...@gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

hi tim,

thanks for quick reply,

i changed the contenthandler to bodyContentHandler i got exception for maximum 
word limit,
i used -1 in the bodycontenthandler constructor,

now its another problem, filenames and content are present in string returned 
from handler.tostring()

how can i map a fileName to its content.

thanks,
yeshwanth

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. 
mailto:talli...@mitre.org>> wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.



Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.





If you want to write out each embedded file as a binary, try subclassing 
EmbeddedResourceHandler.



QUOTE:
0down 
votefavorite<http://stackoverflow.com/questions/24495504/unable-tp-read-zipfile-using-apache-tika?sem=2>


i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

Parser parser = new AutoDetectParser();

ParseContext context = new ParseContext();

context.set(Parser.class, parser);

ContentHandler handler = new DefaultHandler();

Metadata metadata = new Metadata();

InputStream stream = null;

try {

stream = TikaInputStream.get(new File(zipFilePath));

} catch (FileNotFoundException e) {

e.printStackTrace();

}

try {



parser.parse(stream, handler, metadata, context);



logger.info<http://logger.info>("Content:\t" + handler.toString());

} catch (IOException e) {

e.printStackTrace();

} catch (SAXException e) {

e.printStackTrace();

} catch (TikaException e) {

e.printStackTrace();

} finally {

try {

stream.close();

} catch (IOException e) {

e.printStackTrace();

RE: Stack Overflow Question

2014-07-01 Thread Allison, Timothy B.

Good to hear.  Let us know if you have any other questions or when you run into 
surprises.

From: yeshwanth kumar [mailto:yeshwant...@gmail.com]
Sent: Tuesday, July 01, 2014 10:23 AM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

hi tim,

i forgot to change the BodyContentHandler to ToXMLContentHandler in 
RecursiveMetada, i changed it only in my
calling method,

now i am getting the entire document as the structure u specified.

thanks a ton.

-yeshwanth

On Tue, Jul 1, 2014 at 7:16 PM, Allison, Timothy B. 
mailto:talli...@mitre.org>> wrote:
Hmmm….

When I use the ToXMLHandler on the test doc submitted with TIKA-1329, I see 
this:

embed4.zip

embed4.txt
embed_4

That’s a text file inside of a zip file that is itself embedded.  I could see 
doing some parsing on the XML to scrape out  
contents and grab the file name from the  element.

If I committed TIKA-1329, would that be of any use to you?   That returns a 
list of metadata objects.  There is one metadata object per embedded file.  The 
text content of each file can be retrieved from each metadata object by this 
key: “tika:content.”

Best,

Tim
From: yeshwanth kumar 
[mailto:yeshwant...@gmail.com<mailto:yeshwant...@gmail.com>]
Sent: Tuesday, July 01, 2014 9:00 AM

To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

output is same even with ToXMLHandler

On Tue, Jul 1, 2014 at 5:59 PM, Allison, Timothy B. 
mailto:talli...@mitre.org>> wrote:
Did you try the ToXMLHandler?

From: yeshwanth kumar 
[mailto:yeshwant...@gmail.com<mailto:yeshwant...@gmail.com>]
Sent: Monday, June 30, 2014 4:50 PM

To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

hi tim,

i tried in all possible ways,
instead of reading entire zip file i parsed individual zipentries,
but even then i faced exceptions such as

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.microsoft.OfficeParser@37ba3e33<mailto:org.apache.tika.parser.microsoft.OfficeParser@37ba3e33>
Caused by: java.io.IOException: Invalid header signature; read 
0x725020706968736E, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a 
valid OLE2 document

org.apache.tika.exception.TikaException: Unable to unpack document stream

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.microsoft.OfficeParser@6f0ee75a<mailto:org.apache.tika.parser.microsoft.OfficeParser@6f0ee75a>

org.apache.tika.exception.TikaException: Error creating OOXML extractor

any suggestions regarding these issues,

thanks,
yeshwanth

On Tue, Jul 1, 2014 at 2:00 AM, yeshwanth kumar 
mailto:yeshwant...@gmail.com>> wrote:

hi tim,

thanks, for sharing the resources but i am unable to figure out how to 
implement it in my code,
what i didn't understand is the flow and recursive steps, when i ran the 
RecursiveMetadataParser
it still giving the same kind of output as filenames combined with content of 
the files,

i am totally confused.

On Tue, Jul 1, 2014 at 1:29 AM, Allison, Timothy B. 
mailto:talli...@mitre.org>> wrote:
Or use the ToXMLHandler and parse the XML?

From: Allison, Timothy B. [mailto:talli...@mitre.org<mailto:talli...@mitre.org>]
Sent: Monday, June 30, 2014 3:55 PM
To: yeshwanth kumar
Cc: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: RE: Stack Overflow Question

Might want to look into RecursiveMetadata Parser
http://wiki.apache.org/tika/RecursiveMetadata

Or

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true
From: yeshwanth kumar [mailto:yeshwant...@gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

hi tim,

thanks for quick reply,

i changed the contenthandler to bodyContentHandler i got exception for maximum 
word limit,
i used -1 in the bodycontenthandler constructor,

now its another problem, filenames and content are present in string returned 
from handler.tostring()

how can i map a fileName to its content.

thanks,
yeshwanth

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. 
mailto:talli...@mitre.org>> wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.

Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.

If you want to write out each embedded file as a binary, try subclassing 
EmbeddedResourceHandler.

QUOTE:
0down 
votefavorite<http://stackoverflow.com/questions/24495504/unable-tp-read-zipfile-using-apache-tika?sem=2>

i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

Parser parser = new AutoDetectParser();

ParseContext context = new ParseContext();

context.set(Parser.class, parser);

ContentHandler handler = new DefaultHandler();

Metadata metadata = new Metadata();

InputStream stream = null;

try {

RE: How to index the parsed content effectively

2014-07-02 Thread Allison, Timothy B.

Hi Sergey,

  I'd take a look at what the DataImportHandler in Solr does.  If you want to 
store the field, you need to create the field with a String (as opposed to a 
Reader); which means you have to have the whole thing in memory.  Also, if 
you're proposing adding a field entry in a multivalued field for a given SAX 
event, I don't think that will help, because you still have to hold the entire 
document in memory before calling addDocument() if you are storing the field.  
If you aren't storing the field, then you could try a Reader.
 
  Some thoughts:

  At the least, you could create a separate Lucene document for each container 
document and each of its embedded documents.
  
  You could also break large documents into logical sections and index those as 
separate documents; but that gets very use-case dependent.

In practice, for many, many use cases I've come across, you can index quite 
large documents with no problems, e.g. "Moby Dick" or "Dream of the Red 
Chamber."  There may be a hit at highlighting time for large docs depending on 
which highlighter you use.  In the old days, there used to be a 10k default 
limit on the number of tokens, but that is now long gone.
  
  For truly large docs (probably machine generated), yes, you could run into 
problems if you need to hold the whole thing in memory.  
  
 Cheers,

  Tim
-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] 
Sent: Wednesday, July 02, 2014 8:27 AM
To: user@tika.apache.org
Subject: How to index the parsed content effectively

Hi All,

We've been experimenting with indexing the parsed content in Lucene and
our initial attempt was to index the output from
ToTextContentHandler.toString() as a Lucene Text field.

This is unlikely to be effective for large files. So I wonder what
strategies exist for a more effective indexing/tokenization of the
possibly large content.

Perhaps a custom ContentHandler can index content fragments in a unique
Lucene field every time its characters(...) method is called, something
I've been planning to experiment with.

The feedback will be appreciated
Cheers, Sergey

RE: How to index the parsed content effectively

2014-07-14 Thread Allison, Timothy B.

Hi Sergey,

>> Now, we already have the original PDF occupying some space, so 
>>duplicating it (its content) with a Document with Store.YES fields may 
>>not be the best idea in some cases.

In some cases, agreed, but in general, this is probably a good default idea.  
As you point out, you aren't quite duplicating the document -- one copy contain 
the original bytes, and the other contains the text (and metadata?) that was 
extracted from the document.  One reason to store the content in the field is 
for easy highlighting.  You could configure the highlighter to pull the text 
content of the document from a db or other source, but that adds complexity and 
perhaps lookup time.  What you really would not want to do from a time 
perspective is ask Tika to parse the raw bytes to pull the content for 
highlighting at search time.  In general, Lucene's storage of the content is 
very reasonable; on one big batch of text files I have, the Lucene index with 
stored fields is the same size as the uncompressed text files.

>>So I wonder, is it possible somehow for a given Tika Parser, lets say a 
>>PDF parser, report, via the Metadata, the start and end indexes of the 
>>content ? So the consumer will create say InputStreamReader for a 
>>content region and will use Store.NO and this Reader ?

I don't think I quite understand what you're proposing.  The start and end 
indexes of the extracted content?  Wouldn't that just be 0 and the length of 
the string in most cases (beyond-bmp issues aside)?  Or, are you suggesting 
that there may be start and end indexes for content within the actual raw bytes 
of the PDF?  If the latter, for PDFs at least that would effectively require a 
full reparse ... if it were possible, and it probably wouldn't save much in 
time.  For other formats, where that might work, it would create far more 
complexity than value...IMHO.

In general, I'd say store the field.  Perhaps let the user choose to not store 
the field. 

Always interested to hear input from others.

Best,

  Tim

-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] 
Sent: Friday, July 11, 2014 1:38 PM
To: user@tika.apache.org
Subject: Re: How to index the parsed content effectively

Hi Tim, All.
On 02/07/14 14:32, Allison, Timothy B. wrote:
> Hi Sergey,
>
>I'd take a look at what the DataImportHandler in Solr does.  If you want 
> to store the field, you need to create the field with a String (as opposed to 
> a Reader); which means you have to have the whole thing in memory.  Also, if 
> you're proposing adding a field entry in a multivalued field for a given SAX 
> event, I don't think that will help, because you still have to hold the 
> entire document in memory before calling addDocument() if you are storing the 
> field.  If you aren't storing the field, then you could try a Reader.

I'd like to ask something about using Tika parser and a Reader (and 
Lucene Store.NO)

Consider a case where we have a service which accepts a very large PDF 
file. This file will be stored on the disk or may be in some DB. And 
this service will also use Tika to extract content and populate a Lucene 
Document.
Now, we already have the original PDF occupying some space, so 
duplicating it (its content) with a Document with Store.YES fields may 
not be the best idea in some cases.

So I wonder, is it possible somehow for a given Tika Parser, lets say a 
PDF parser, report, via the Metadata, the start and end indexes of the 
content ? So the consumer will create say InputStreamReader for a 
content region and will use Store.NO and this Reader ?

Does it really make sense at all ? I can create a minor enhancement 
request for parsers getting the access to a low level info like the 
start/stop delimiters of the content to report it ?

Cheers, Sergey

>
>Some thoughts:
>
>At the least, you could create a separate Lucene document for each 
> container document and each of its embedded documents.
>
>You could also break large documents into logical sections and index those 
> as separate documents; but that gets very use-case dependent.
>
>  In practice, for many, many use cases I've come across, you can index 
> quite large documents with no problems, e.g. "Moby Dick" or "Dream of the Red 
> Chamber."  There may be a hit at highlighting time for large docs depending 
> on which highlighter you use.  In the old days, there used to be a 10k 
> default limit on the number of tokens, but that is now long gone.
>
>For truly large docs (probably machine generated), yes, you could run into 
> problems if you need to hold the whole thing in memory.
>
>   Cheers,
>
>Tim
> -Original Message-
> From: Sergey Beryozkin [mailto:sberyoz...@gmai

RE: Avoiding Out of Memory Errors

2014-07-18 Thread Allison, Timothy B.

I'm working on adding a daemon to Tika Server so that it will restart when it 
hits an OOM or other big problem (infinite hangs).  That won't be available 
until Tika 1.7.  

To amplify Nick's recommendations:

ForkParser or Server are your best options for now.

Are there specific files/file types that are causing the OOM?  Given the size 
of files, is the OOM surprising?  

On TIKA-1294, we found that a specific 4MB PDF would cause an OOM with -Xmx1g.  
 That was surprising and was very quickly addressed by the PDFBox developers.  
If you have specific files that are surprising, please file an issue.

Thank you!

From: Nick Burch [apa...@gagravarr.org]
Sent: Friday, July 18, 2014 4:32 AM
To: user@tika.apache.org
Subject: Re: Avoiding Out of Memory Errors

On Thu, 17 Jul 2014, Shannon Brown wrote:
> Problem:
> How to avoid Out of Memory errors during Tika parsing.

Typical approaches are either to use the ForkParser, or the Tika Server.
Both ensure that if there's a fatal problem with parsing (eg OOM) then
the JVM with your main application in it doesn't die too

For cases where it does die, log it, and if possible report a bug with the
file in question, so we can hopefully fix it for the next release!

Nick

RE: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-28 Thread Allison, Timothy B.

+1

Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
Windows 7, Java 1.7

I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs (all 
formats) plus all available msoffice-x files in govdocs1, yielding 10,413 docs. 
 There were several improvements in text extraction for PDFs (mostly spacing) 
and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).

There was one regression:
http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx 

Stacktrace:
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of 
range: -369073454
at java.lang.String.checkBounds(String.java:371)
at java.lang.String.(String.java:415)
at 
org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114)
at 
org.apache.poi.poifs.filesystem.Ole10Native.(Ole10Native.java:163)
at 
org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:91)
at 
org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:63)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE(AbstractOOXMLExtractor.java:250)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:115)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)


-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Monday, July 28, 2014 12:22 AM
To: d...@tika.apache.org
Cc: user@tika.apache.org
Subject: [VOTE] Apache Tika 1.6 release candidate #1

Hi Folks,

A candidate for the Tika 1.6 release is available at:

http://people.apache.org/~mattmann/apache-tika-1.6/rc1/


The release candidate is a zip archive of the sources in:

http://svn.apache.org/repos/asf/tika/tags/1.6/

The SHA1 checksum of the archive is
076ad343be56a540a4c8e395746fa4fda5b5b6d3.

A Maven staging repository is available at:

https://repository.apache.org/content/repositories/orgapachetika-1003/


Please vote on releasing this package as Apache Tika 1.6.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.6
[ ] -1 Do not release this package becauseŠ

Thank you!

Cheers,
Chris

P.S. Here is my +1!

RE: Tika - Outlook msg file with another Outlook msg as an attachment - OutlookExtractor passes empty stream

2014-07-31 Thread Allison, Timothy B.

AarKay,

  We have a unit test for an MSG embedded within an MSG in 
POIContainerExtractionTest.  I also just tried a newly created msg within an 
msg file, and I can extract the embedded content with 
TikaTest.RecursiveMetaParser.  This suggests that the issue is not within the 
OutlookParser.

  If you want the bytes of the embedded file, have you tried (or are you using) 
the Unpacker Resource?  IIRC, this gets the attachments (non-recursively!!!) 
out of each doc you send it and sends you back a zip (or tar).  You should be 
able to step through the ZipEntr(ies) and get the original attachment bytes.

   Best,

 Tim
  

-Original Message-
From: AarKay [mailto:ksu.wildc...@gmail.com] 
Sent: Thursday, July 31, 2014 12:30 AM
To: user@tika.apache.org
Subject: Tika - Outlook msg file with another Outlook msg as an attachment - 
OutlookExtractor passes empty stream

I am using Tika Server (TikaJaxRs) for text extraction needs.
I also have a need to extract the attachments in the file and save it to the 
disk in its native format.
I was able to do it by having CustomParser and write the file to disk using 
'stream' in parse method.

Here is the post I used as a reference for building CustomParser.
http://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-
files-using-apache-tika

I was able to get it work fine if the attachment is anything but Outlook msg 
file.

I am running into an issue when the attachment is a Outlook msg file.
When CustomParser.parse method gets invoked the stream passed to it is empty 
because of which the file thats being written to disk is always 0 KB.

Digging through the code I noticed that in OutlookExtractor.java class the 
attachment is handled by OfficeParser because msg.attachdata is always null 
when attachment is a Outlook msg and thats where it is always sending empty 
stream to CustomParser.

Here is the snippet of code from OutlookExtractor where it iterates through 
Attachment files and uses handleEmbeddedResource method only when 
msg.attachData is not null.
But msg.attachData is always null if the Attachment is of type Outlook msg 
because of which stream is always empty when delegating the request to 
CustomParser.parse method.

Can someone please tell me how can i access the msg attachment and save it 
to disk in its Native format?

for (AttachmentChunks attachment : msg.getAttachmentFiles()) {
   xhtml.startElement("div", "class", "attachment-entry");  
 
   String filename = null;
   if (attachment.attachLongFileName != null) {
  filename = attachment.attachLongFileName.getValue();
   } else if (attachment.attachFileName != null) {
  filename = attachment.attachFileName.getValue();
   }
   if (filename != null && filename.length() > 0) {
   xhtml.element("h1", filename);
   }   
   if(attachment.attachData != null) {
  handleEmbeddedResource(
TikaInputStream.get(attachment.attachData.getValue()),
filename,
null, xhtml, true
  );
   }
   if(attachment.attachmentDirectory != null) {
  handleEmbededOfficeDoc(
attachment.attachmentDirectory.getDirectory(),
xhtml
  );
   }
   xhtml.endElement("div");   
   }


Thanks
-AarKay

RE: TIKA - how to read chunks at a time from a very large file?

2014-08-28 Thread Allison, Timothy B.

Probably better question for the user list.

Extending a ContentHandler and using that in ContentHandlerDecorator is pretty 
straightforward.

Would it be easy enough to write to file by passing in an OutputStream to 
WriteOutContentHandler?

-Original Message-
From: ruby [mailto:rshoss...@gmail.com] 
Sent: Thursday, August 28, 2014 2:07 PM
To: tika-...@lucene.apache.org
Subject: TIKA - how to read chunks at a time from a very large file?

Using ContentHandler is there a way to read chunks at a time from a very
large file (over 5GB). Right now I'm doing following to read the entire
content at once:

InputStream stream = new FileInputStream(file);
Parser p = new AutoDetectParser();
Metadata meta =new Metadata();
WriteOutContentHandler handler = new WriteOutContnetHandler(-1);
ParseContext parse = new ParseContext();
p.parse(stream,handler,meta, context);
String content = handler.toString();

Since the files contain over 5GB data, the content string here will end up
too much data in memory. I want to avoid this and want to read chunk at a
time.

I tried ParsingReader and I can read chunks using this but we are splitting
on words. Some of the files have Chinese/Japanese words, so we can't process
using white-spaces either. 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/TIKA-how-to-read-chunks-at-a-time-from-a-very-large-file-tp4155644.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

FW: How to exclude a mimetype in tika?

2014-09-18 Thread Allison, Timothy B.

Tika Colleagues (Tika'ers, Tikis?),

Is this the right answer:

Drop the relevant parsers from the tika.config file and make sure to point solr 
to this file in your solr request handler definition: /my/path/to/tika.config?

  I only have experience as a programmatic user of Tika and would use a 
DocumentSelector, but would the above work?

-Original Message-
From: keeblerh [mailto:keebl...@yahoo.com] 
Sent: Thursday, September 18, 2014 10:15 AM
To: solr-u...@lucene.apache.org
Subject: Re: How to exclude a mimetype in tika?

eShard wrote
> Good afternoon,
> I'm using solr 4.0 Final
> I need movies "hidden" in zip files that need to be excluded from the
> index.
> I can't filter movies on the crawler because then I would have to exclude
> all zip files.
> I was told I can have tika skip the movies.
> the details are escaping me at this point.
> How do I exclude a file in the tika configuration?
> I assume it's something I add in the update/extract handler but I'm not
> sure.
> 
> Thanks,

I am having the same issue.  I need to exlcude some mime types from the zip
files and using SOLR 4.8.  Did you ever get an answer to this?  THanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-exclude-a-mimetype-in-tika-tp4127168p4159676.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Apache Tika - JSON?

2014-09-26 Thread Allison, Timothy B.

The current json output option in the app and server only dump metadata…as you 
probably know.

I plan to add a json version of the RecursiveParserWrapper (list of Metadata 
objects with one entry for content) to the app shortly.  Would that be of any 
use?

Are you using the app, the server, or calling Tika programmatically?


From: Vineet Ghatge Hemantkumar [mailto:heman...@usc.edu]
Sent: Thursday, September 25, 2014 11:06 PM
To: user@tika.apache.org
Subject: Apache Tika - JSON?

Hello all,

I was wondering if there any in built parser to get help in conversion from 
XHTML to JSON.

My research showed that there is one named org.apache.io.json which just one 
method implemented. Also, I tried GJSON library to do this, but it does not 
seem to work with Tika. Any suggestions will be appreciated?

Regards,
Vineet

RE: Apache Tika - JSON?

2014-09-26 Thread Allison, Timothy B.

I suspect, though, that what you want is not what I answered (sorry!)…namely 
entities mapped from xhtml to json.  For that, I don’t think we have anything 
available in Tika, but it wouldn’t be difficult (famous last words) to write a 
content handler to do that…

We have integrated the GSON library to serialize/deserialize Metadata objects 
in tika-serialization.

From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Friday, September 26, 2014 6:54 AM
To: user@tika.apache.org
Subject: RE: Apache Tika - JSON?

The current json output option in the app and server only dump metadata…as you 
probably know.

I plan to add a json version of the RecursiveParserWrapper (list of Metadata 
objects with one entry for content) to the app shortly.  Would that be of any 
use?

Are you using the app, the server, or calling Tika programmatically?


From: Vineet Ghatge Hemantkumar [mailto:heman...@usc.edu]
Sent: Thursday, September 25, 2014 11:06 PM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Apache Tika - JSON?

Hello all,

I was wondering if there any in built parser to get help in conversion from 
XHTML to JSON.

My research showed that there is one named org.apache.io.json which just one 
method implemented. Also, I tried GJSON library to do this, but it does not 
seem to work with Tika. Any suggestions will be appreciated?

Regards,
Vineet

RE: Problem with content extraction

2014-10-07 Thread Allison, Timothy B.

I’ve seen this before on a few documents.  You might experiment with setting 
PDFParserConfig’s suppressDuplicateOverlappingText to true.  If that doesn’t 
work, I’d recommend running the pure PDFBox app’s ExtractText on the document.  
If you get the same doubling of letters, ask over on 
u...@pdfbox.apache.org.  If you don’t, let us 
know!

Best,

   Tim


From: Mohammad Ghufran [mailto:emghuf...@gmail.com]
Sent: Tuesday, October 07, 2014 8:37 AM
To: user@tika.apache.org
Subject: Problem with content extraction

Hello,

I am using tika to extract content of documents using tika but I've run into a 
problem. In some documents, the characters in the output are repeated several 
times. For example, while processing a PDF file, the text "FORMATION" is 
transformed into "FFOORRMMAATTIIOONN" and so on.

I tried looking through the mailing lists but didn't find any reference to 
this. I also tried with the latest version of tika but it results in the same 
output.

The only thing i can notice is that the document seems to have text written 
with some shadow - if it is useful.

I would like to know if someone has encountered this  problem before and what 
are the possible solutions, if any.

Best Regards,
Ghufran

RE: Customizing Metadata Keys

2014-10-09 Thread Allison, Timothy B.

I agree with Nick’s recommendation on post-parsing key mapping, and I’d like to 
put in a plug for the RecursiveParserWrapper, which may be of use for you.  
I’ve been intending to add that to the app commandline and to server…how are 
you handling embedded document metadata?  Would the wrapper be of any use or do 
you not have any embedded docs in your doc set?

I’ve also been meaning to dump counts of metadata keys from the govdocs1 
corpus, would that be of any use, or do you already know the keys that you care 
about?

Cheers,

 Tim
From: Can Duruk [mailto:c...@duruk.net]
Sent: Thursday, October 09, 2014 12:13 PM
To: user@tika.apache.org
Subject: Re: Customizing Metadata Keys

>I'd suggest you do the mapping from Tika keys to your keys in the server.
>All the parsers should return consistent keys, so the "output" side is
>the
>best place to map.

That seems to be the now-obvious solution, thanks for the suggestion.

> Perhaps a re-mapping downstream ContentHandler
> that takes in the Metadata object and will reformat
> the mailto:chris.mattm...@gmail.com>> wrote:
>
> Perhaps a re-mapping downstream ContentHandler
> that takes in the Metadata object and will reformat
> the 
>
> 
> Chris Mattmann
> chris.mattm...@gmail.com
>
>
>
>
> -Original Message-
> From: Nick Burch mailto:apa...@gagravarr.org>>
> Reply-To: mailto:user@tika.apache.org>>
> Date: Thursday, October 9, 2014 at 12:32 PM
> To: mailto:user@tika.apache.org>>
> Subject: Re: Customizing Metadata Keys
>
> >On Wed, 8 Oct 2014, Can Duruk wrote:
> >> My question is regarding setting the metadata keys coming from the
> >>parsers
> >> to my own keys.
> >>
> >> For my application, I am using Tika to extract the metadata for a bunch
> >>of
> >> files. I am using the embedded HTTP server which I modified for my
> >>needs to
> >> return instead of CSV. (Hoping to submit that as a patch soon)
> >>
> >> However, the keys in the JSON are all in different formats and I need
> >>them
> >> to conform to my own requirements.
> >
> >I'd suggest you do the mapping from Tika keys to your keys in the server.
> >All the parsers should return consistent keys, so the "output" side is
> >the
> >best place to map. Trying to do it in each parser would be much more
> >work.
> >Just put the mapping in between where you call the parser, and where you
> >output
> >
> >Nick
>
>

internal vs external property?

2014-11-20 Thread Allison, Timothy B.

All,
  What is the difference between an internal and an external Property?  I'm not 
(quickly) seeing how Metadata is using that Boolean.  Are there other pieces of 
code that make use of the distinction?
  Thank you.

 Best,

Tim

RE: Encrypted PDF issues & build issues

2014-12-11 Thread Allison, Timothy B.

Y, sorry.  As you point out, that should be fixed in PDFBox 1.8.8.  A vote was 
just taken for that, so that will be out very soon.  Last I looked at 
integrating PDFBox 1.8.8-SNAPSHOT, the upgrade requires us to change one test 
(I think?) in Tika…which is why you’re getting a failed build.  Your error 
message is not what I was getting, but it was in that test.

In short…by early next week (I hope), Tika trunk will be good to go with PDFBox 
1.8.8.

If you’d like the one or two lines of code to change to get a Tika to build 
with 1.8.8-SNAPSHOT, let me know.

Best,

   Tim

From: Peter Bowyer [mailto:pe...@mapledesign.co.uk]
Sent: Thursday, December 11, 2014 12:43 PM
To: user@tika.apache.org
Subject: Encrypted PDF issues & build issues

Hi list,

I'm having issues with encrypted PDFs



PDF Testcases pass, but fail on my own encrypted PDF (sample file at 
https://dl.dropboxusercontent.com/u/2460167/encryption.pdf. Its password is 
'testing123')

To rule out a problem with the PDF I tested with Xpdf, and pdftotext extracts 
the text without issue. Unfortunately I need the metadata too.

$ pdftotext -opw testing123 encrypted.pdf

I'm running on Centos 6.6, and the Java packages installed are:
java-1.6.0-openjdk.x86_64   1:1.6.0.33-1.13.5.1.el6_6
java-1.6.0-openjdk-devel.x86_64 1:1.6.0.33-1.13.5.1.el6_6
java-1.7.0-openjdk.x86_64   1:1.7.0.71-2.5.3.1.el6 @updates
java-1.7.0-openjdk-devel.x86_64 1:1.7.0.71-2.5.3.1.el6 @updates


Some outputs:

$ java -jar tika-app-1.7-SNAPSHOT.jar --password=testing123 ~/sample.pdf
INFO - Document is encrypted
Exception in thread "main" org.apache.tika.exception.TikaException: Unable to 
extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:150)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:161)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:146)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:440)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:116)
Caused by: java.io.IOException: javax.crypto.IllegalBlockSizeException: Input 
length must be multiple of 16 when decrypting with padded cipher
at 
javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:115)
at javax.crypto.CipherInputStream.read(CipherInputStream.java:233)
at javax.crypto.CipherInputStream.read(CipherInputStream.java:209)
at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.encryptData(SecurityHandler.java:312)
at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptStream(SecurityHandler.java:413)
at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:386)
at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptObject(SecurityHandler.java:361)
at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.proceedDecryption(SecurityHandler.java:192)
at 
org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:158)
at 
org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1597)
at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:943)
at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:337)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134)
... 7 more
Caused by: javax.crypto.IllegalBlockSizeException: Input length must be 
multiple of 16 when decrypting with padded cipher
at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:750)
at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:676)
at com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:420)
at javax.crypto.Cipher.doFinal(Cipher.java:1805)
at 
javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:112)
... 19 more




I searched the pdfbox issue tracker and found 
https://issues.apache.org/jira/browse/PDFBOX-2469 and 
https://issues.apache.org/jira/browse/PDFBOX-2510, which in turn link to 
related issues. The ticket status says a number of these issues are fixed in 
the 1.8.8 snapshot, and if you run using the Non-Sequential Parser.

So I edited `tika-parsers/pom.xml` and set 
1.8.8-SNAPSHOT. I also edit 
`tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties`
 and enable the non-sequential parser.

Now tika won't build. I change PDFParser.properties back and it won't build 
either.

Running org.apache.tika.parser.pdf.PDFParserTest
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 
(origin offset 0)
ERROR [main] (NonSe

RE: Encrypted PDF issues & build issues

2014-12-15 Thread Allison, Timothy B.

Upgrade just made in Tika trunk.  The integration required more than changing 
the one test…Sorry about that!

Let us know if there are any surprises with the upgrade.

From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, December 11, 2014 2:41 PM
To: user@tika.apache.org
Subject: RE: Encrypted PDF issues & build issues

Y, sorry.  As you point out, that should be fixed in PDFBox 1.8.8.  A vote was 
just taken for that, so that will be out very soon.  Last I looked at 
integrating PDFBox 1.8.8-SNAPSHOT, the upgrade requires us to change one test 
(I think?) in Tika…which is why you’re getting a failed build.  Your error 
message is not what I was getting, but it was in that test.

In short…by early next week (I hope), Tika trunk will be good to go with PDFBox 
1.8.8.

If you’d like the one or two lines of code to change to get a Tika to build 
with 1.8.8-SNAPSHOT, let me know.

Best,

   Tim

From: Peter Bowyer [mailto:pe...@mapledesign.co.uk]
Sent: Thursday, December 11, 2014 12:43 PM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Encrypted PDF issues & build issues

Hi list,

I'm having issues with encrypted PDFs

PDF Testcases pass, but fail on my own encrypted PDF (sample file at 
https://dl.dropboxusercontent.com/u/2460167/encryption.pdf. Its password is 
'testing123')

To rule out a problem with the PDF I tested with Xpdf, and pdftotext extracts 
the text without issue. Unfortunately I need the metadata too.

$ pdftotext -opw testing123 encrypted.pdf

I'm running on Centos 6.6, and the Java packages installed are:
java-1.6.0-openjdk.x86_64   1:1.6.0.33-1.13.5.1.el6_6
java-1.6.0-openjdk-devel.x86_64 1:1.6.0.33-1.13.5.1.el6_6
java-1.7.0-openjdk.x86_64   1:1.7.0.71-2.5.3.1.el6 @updates
java-1.7.0-openjdk-devel.x86_64 1:1.7.0.71-2.5.3.1.el6 @updates

Some outputs:

$ java -jar tika-app-1.7-SNAPSHOT.jar --password=testing123 ~/sample.pdf
INFO - Document is encrypted
Exception in thread "main" org.apache.tika.exception.TikaException: Unable to 
extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:150)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:161)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:146)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:440)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:116)
Caused by: java.io.IOException: javax.crypto.IllegalBlockSizeException: Input 
length must be multiple of 16 when decrypting with padded cipher
at 
javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:115)
at javax.crypto.CipherInputStream.read(CipherInputStream.java:233)
at javax.crypto.CipherInputStream.read(CipherInputStream.java:209)
at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.encryptData(SecurityHandler.java:312)
at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptStream(SecurityHandler.java:413)
at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:386)
at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptObject(SecurityHandler.java:361)
at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.proceedDecryption(SecurityHandler.java:192)
at 
org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:158)
at 
org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1597)
at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:943)
at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:337)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134)
... 7 more
Caused by: javax.crypto.IllegalBlockSizeException: Input length must be 
multiple of 16 when decrypting with padded cipher
at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:750)
at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:676)
at com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:420)
at javax.crypto.Cipher.doFinal(Cipher.java:1805)
at 
javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:112)
... 19 more

I searched the pdfbox issue tracker and found 
https://issues.apache.org/jira/browse/PDFBOX-2469 and 
https://issues.apache.org/jira/browse/PDFBOX-2510, which in turn link to 
related issues. The ticket status says a number of these issues are fixed in 
the 1.8.8 snapshot, and if you run using the Non-Sequential Parser.

So I edited `tika-parsers/p

RE: Outputting JSON from tika-server/meta

2014-12-18 Thread Allison, Timothy B.

Do you have any luck if you call /metadata instead of /meta?

That should trigger MetadataEP which will return Json, no?

I'm not sure why we have both handlers, but we do...


-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] 
Sent: Thursday, December 18, 2014 9:56 AM
To: user@tika.apache.org
Subject: Re: Outputting JSON from tika-server/meta

Hi Peter
Thanks, you are too nice, it is a minor bug :-)
Cheers, Sergey
On 18/12/14 14:50, Peter Bowyer wrote:
> Thanks Sergey, I have opened TIKA-1497 for this enhancement.
>
> Best wishes,
> Peter
>
> On 18 December 2014 at 14:31, Sergey Beryozkin  > wrote:
>
> Hi,
> I see MetadataResource returning StreamingOutput and it has
> @Produces(text/csv) only. As such this MBW has no effect at the moment.
>
> We can update MetadataResource to return Metadata directly if
> application/json is requested or update MetadataResource to directly
> convert Metadata to JSON in case of JSON being accepted
>
> Can you please open a JIRA issue ?
>
> Cheers, Sergey
>
>
>
> On 18/12/14 13:58, Peter Bowyer wrote:
>
> Hi,
>
> I suspect this has a really simple answer, but it's eluding me.
>
> How do I get the response from
> curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta
> to be JSON and not CSV?
>
> I've discovered JSONMessageBodyWriter.java
> 
> (https://github.com/apache/__tika/blob/__af19f3ea04792cad81b428f1df9f5e__bbb2501913/tika-server/src/__main/java/org/apache/tika/__server/JSONMessageBodyWriter.__java
> 
> )
> so I think the functionality is present, tried adding --header
> "Accept:
> application/json" to the cURL call, in line with the
> documentation for
> outputting CSV, but no luck so far.
>
> Many thanks,
> Peter
>
>
>
>
> --
> Maple Design Ltd
> http://www.mapledesign.co.uk
> +44 (0)845 123 8008
>
> Reg. in England no. 05920531

Tika 2.0???

2014-12-18 Thread Allison, Timothy B.

I feel Tika 2.0 coming up soon (well, April-ish?!) and the breaking of some 
other areas of back compat, esp. parser class loading -> config ... 

What other areas for breaking or revamping do others see for 2.0?

We need a short-term fix to get the tesseract ocr integration+metadata out the 
door with 1.7, of course.


-Original Message-
From: Chris Mattmann [mailto:chris.mattm...@gmail.com] 
Sent: Thursday, December 18, 2014 10:42 AM
To: user@tika.apache.org
Subject: Re: Outputting JSON from tika-server/meta

Yeah I think we should probably combine them..and make
JSON the default (which unfortunately would break back
compat, but in my mind would make a lot more sense)


Chris Mattmann
chris.mattm...@gmail.com




-Original Message-
From: "Allison, Timothy B." 
Reply-To: 
Date: Thursday, December 18, 2014 at 7:20 AM
To: "user@tika.apache.org" 
Subject: RE: Outputting JSON from tika-server/meta

>Do you have any luck if you call /metadata instead of /meta?
>
>That should trigger MetadataEP which will return Json, no?
>
>I'm not sure why we have both handlers, but we do...
>
>
>-Original Message-
>From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
>Sent: Thursday, December 18, 2014 9:56 AM
>To: user@tika.apache.org
>Subject: Re: Outputting JSON from tika-server/meta
>
>Hi Peter
>Thanks, you are too nice, it is a minor bug :-)
>Cheers, Sergey
>On 18/12/14 14:50, Peter Bowyer wrote:
>> Thanks Sergey, I have opened TIKA-1497 for this enhancement.
>>
>> Best wishes,
>> Peter
>>
>> On 18 December 2014 at 14:31, Sergey Beryozkin > <mailto:sberyoz...@gmail.com>> wrote:
>>
>> Hi,
>> I see MetadataResource returning StreamingOutput and it has
>> @Produces(text/csv) only. As such this MBW has no effect at the
>>moment.
>>
>> We can update MetadataResource to return Metadata directly if
>> application/json is requested or update MetadataResource to directly
>> convert Metadata to JSON in case of JSON being accepted
>>
>> Can you please open a JIRA issue ?
>>
>> Cheers, Sergey
>>
>>
>>
>> On 18/12/14 13:58, Peter Bowyer wrote:
>>
>> Hi,
>>
>> I suspect this has a really simple answer, but it's eluding me.
>>
>> How do I get the response from
>> curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta
>> to be JSON and not CSV?
>>
>> I've discovered JSONMessageBodyWriter.java
>> 
>>(https://github.com/apache/__tika/blob/__af19f3ea04792cad81b428f1df9f5e__
>>bbb2501913/tika-server/src/__main/java/org/apache/tika/__server/JSONMessa
>>geBodyWriter.__java
>> 
>><https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb250
>>1913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWrit
>>er.java>)
>> so I think the functionality is present, tried adding --header
>> "Accept:
>> application/json" to the cURL call, in line with the
>> documentation for
>> outputting CSV, but no luck so far.
>>
>> Many thanks,
>> Peter
>>
>>
>>
>>
>> --
>> Maple Design Ltd
>> http://www.mapledesign.co.uk
>> <http://www.mapledesign.co.uk/>+44 (0)845 123 8008
>>
>> Reg. in England no. 05920531
>
>

RE: Outputting JSON from tika-server/meta

2014-12-18 Thread Allison, Timothy B.

Ha, yes, that is on my ever growing list of todos.  That is slightly different, 
though, from metadata so I’d want to add a separate endpoint.

Does the format you get with the –J option on tika-app from 1.7-SNAPSHOT work 
for you?



From: Peter Bowyer [mailto:pe...@mapledesign.co.uk]
Sent: Thursday, December 18, 2014 10:53 AM
To: user@tika.apache.org
Subject: Re: Outputting JSON from tika-server/meta

If the API is being modified, could we add an endpoint which will return a 
combined JSON output, like:
{
   "meta" : { ... },
   "content" : { "string of content" }
}

This would save me making two API calls, fetching each individually and loading 
the document twice. /unpack does something similar, but returns a single file.

Peter

On 18 December 2014 at 15:42, Chris Mattmann 
mailto:chris.mattm...@gmail.com>> wrote:
Yeah I think we should probably combine them..and make
JSON the default (which unfortunately would break back
compat, but in my mind would make a lot more sense)


Chris Mattmann
chris.mattm...@gmail.com<mailto:chris.mattm...@gmail.com>




-Original Message-
From: "Allison, Timothy B." mailto:talli...@mitre.org>>
Reply-To: mailto:user@tika.apache.org>>
Date: Thursday, December 18, 2014 at 7:20 AM
To: "user@tika.apache.org<mailto:user@tika.apache.org>" 
mailto:user@tika.apache.org>>
Subject: RE: Outputting JSON from tika-server/meta

>Do you have any luck if you call /metadata instead of /meta?
>
>That should trigger MetadataEP which will return Json, no?
>
>I'm not sure why we have both handlers, but we do...
>
>
>-Original Message-
>From: Sergey Beryozkin 
>[mailto:sberyoz...@gmail.com<mailto:sberyoz...@gmail.com>]
>Sent: Thursday, December 18, 2014 9:56 AM
>To: user@tika.apache.org<mailto:user@tika.apache.org>
>Subject: Re: Outputting JSON from tika-server/meta
>
>Hi Peter
>Thanks, you are too nice, it is a minor bug :-)
>Cheers, Sergey
>On 18/12/14 14:50, Peter Bowyer wrote:
>> Thanks Sergey, I have opened TIKA-1497 for this enhancement.
>>
>> Best wishes,
>> Peter
>>
>> On 18 December 2014 at 14:31, Sergey Beryozkin 
>> mailto:sberyoz...@gmail.com>
>> <mailto:sberyoz...@gmail.com<mailto:sberyoz...@gmail.com>>> wrote:
>>
>> Hi,
>> I see MetadataResource returning StreamingOutput and it has
>> @Produces(text/csv) only. As such this MBW has no effect at the
>>moment.
>>
>> We can update MetadataResource to return Metadata directly if
>> application/json is requested or update MetadataResource to directly
>> convert Metadata to JSON in case of JSON being accepted
>>
>> Can you please open a JIRA issue ?
>>
>> Cheers, Sergey
>>
>>
>>
>> On 18/12/14 13:58, Peter Bowyer wrote:
>>
>> Hi,
>>
>> I suspect this has a really simple answer, but it's eluding me.
>>
>> How do I get the response from
>> curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta
>> to be JSON and not CSV?
>>
>> I've discovered JSONMessageBodyWriter.java
>>
>>(https://github.com/apache/__tika/blob/__af19f3ea04792cad81b428f1df9f5e__
>>bbb2501913/tika-server/src/__main/java/org/apache/tika/__server/JSONMessa
>>geBodyWriter.__java
>>
>><https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb250
>>1913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWrit
>>er.java>)
>> so I think the functionality is present, tried adding --header
>> "Accept:
>> application/json" to the cURL call, in line with the
>> documentation for
>> outputting CSV, but no luck so far.
>>
>> Many thanks,
>> Peter
>>
>>
>>
>>
>> --
>> Maple Design Ltd
>> http://www.mapledesign.co.uk
>> <http://www.mapledesign.co.uk/>+44 (0)845 123 
>> 8008
>>
>> Reg. in England no. 05920531
>
>



--
Maple Design Ltd
http://www.mapledesign.co.uk
<http://www.mapledesign.co.uk/>+44 (0)845 123 8008

Reg. in England no. 05920531

RE: Outputting JSON from tika-server/meta

2014-12-18 Thread Allison, Timothy B.

Doh!  K, looks like we aren’t loading that in TikaServerCLI.

Does anyone know how we’re using MetadataEP?

From: Peter Bowyer [mailto:pe...@mapledesign.co.uk]
Sent: Thursday, December 18, 2014 10:57 AM
To: user@tika.apache.org
Subject: Re: Outputting JSON from tika-server/meta

On 18 December 2014 at 15:20, Allison, Timothy B. 
mailto:talli...@mitre.org>> wrote:
Do you have any luck if you call /metadata instead of /meta?

I have no luck with that:

Dec 18, 2014 3:55:21 PM org.apache.cxf.jaxrs.utils.JAXRSUtils findTargetMethod
WARNING: No operation matching request path "/metadata" is found, Relative 
Path: /metadata, HTTP Method: PUT, ContentType: */*, Accept: */*,. Please 
enable FINE/TRACE log level for more details.
Dec 18, 2014 3:55:21 PM org.apache.cxf.jaxrs.impl.WebApplicationExceptionMapper 
toResponse
WARNING: javax.ws.rs.ClientErrorException: HTTP 404 Not Found
at 
org.apache.cxf.jaxrs.utils.SpecExceptions.toHttpException(SpecExceptions.java:117)
at 
org.apache.cxf.jaxrs.utils.ExceptionUtils.toHttpException(ExceptionUtils.java:157)
at 
org.apache.cxf.jaxrs.utils.JAXRSUtils.findTargetMethod(JAXRSUtils.java:526)
at 
org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.processRequest(JAXRSInInterceptor.java:177)
at 
org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.handleMessage(JAXRSInInterceptor.java:77)
at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
at 
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
at 
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:243)
at 
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
at 
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:651)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)

Best regards,
Peter

Re: Outputting JSON from tika-server/meta

2014-12-18 Thread Allison, Timothy B.

Peter,
 I'm waiting on feedback on TIKA-1497, but rmeta should get you what you want 
via TIKA-1498.

Let us know if there are any surprises.

   Best,

Tim

-Original Message-
From: Tim Allison (JIRA) [mailto:j...@apache.org] 
Sent: Thursday, December 18, 2014 2:52 PM
To: d...@tika.apache.org
Subject: [jira] [Resolved] (TIKA-1498) Add RecursiveParserWrapper to tika-server


 [ 
https://issues.apache.org/jira/browse/TIKA-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1498.
---
Resolution: Fixed

r1646520.

End point for now is "rmeta"

> Add RecursiveParserWrapper to tika-server
> -
>
> Key: TIKA-1498
> URL: https://issues.apache.org/jira/browse/TIKA-1498
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>
> In TIKA-1451, we added Jukka/Nick's Recursive Parser Wrapper to tika-app.  
> Let's add that format of output to tika-server.
> What should we call the endpoint: rmeta?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

RE: Outputting JSON from tika-server/meta

2014-12-19 Thread Allison, Timothy B.

All,

With many thanks to Sergey, I added JSON and XMP to “/meta” and I folded in 
MetadataEP into MetadataResource so that users can request a specific metadata 
value(s). (TIKA-1497, TIKA-1499)

I also added a new endpoint “/rmeta” that is equivalent to tika-app’s –J 
(TIKA-1498) – JSONified view of a list of metadata objects representing the 
container document and all embedded docs…aka Jukka and Nick’s 
RecursiveParserWrapper.

I also updated the jax-rs wiki to reflect these changes.

Please kick the tires and let us know if there are any surprises.

Best,

   Tim
From: Peter Bowyer [mailto:pe...@mapledesign.co.uk]
Sent: Thursday, December 18, 2014 8:58 AM
To: user@tika.apache.org
Subject: Outputting JSON from tika-server/meta

Hi,

I suspect this has a really simple answer, but it's eluding me.

How do I get the response from
curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta
to be JSON and not CSV?

I've discovered JSONMessageBodyWriter.java 
(https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb2501913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java)
 so I think the functionality is present, tried adding --header "Accept: 
application/json" to the cURL call, in line with the documentation for 
outputting CSV, but no luck so far.

Many thanks,
Peter

RE: Running tika-server as a service

2015-01-08 Thread Allison, Timothy B.

Peter,
  I don’t have any immediate solutions, but there are two options in the 
pipeline (probably Tika 1.8):


1)  Lewis John McGibbney on TIKA-894 is going to add a war/webapp.

2)  I plan to open an issue related to TIKA-1330 that will make our current 
jax-rs tika-server more robust to OOM and permanent hangs, i.e. the server 
process will shut itself down if it encounters either of these, and a watcher 
process will restart the server process… as currently happens in the dev 
version of TIKA-1330.

  This is an interest close to my heart, and I look forward to hearing how 
others are handling this.

  Best,

   Tim

From: Peter Bowyer [mailto:pe...@mapledesign.co.uk]
Sent: Thursday, January 08, 2015 6:47 AM
To: user@tika.apache.org
Subject: Running tika-server as a service

Hi,

I want to ensure tika-server is always running, and continues to after restarts 
etc.

I have a hacked together an init script (this being CentOS release 6.6) that 
seems to work (it's running, though not restarted the server yet to test) but 
it's an ugly way to manage things.

How do you keep tika-server running? A daemon manager like daemon tools?  
Handcrafted init.d/upstart/systemd scripts? Is anyone able to share what they 
use?

Thanks,
Peter

RE: Running tika-server as a service

2015-01-08 Thread Allison, Timothy B.

Doh!  My answer focused on my interests rather than your question.  Sorry.  By 
restart, I now assume you mean system restart…  TIKA-894 should help with that 
if you configure your server container (tomcat?) to automatically start/restart.

From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, January 08, 2015 8:28 AM
To: user@tika.apache.org
Subject: RE: Running tika-server as a service

Peter,
  I don’t have any immediate solutions, but there are two options in the 
pipeline (probably Tika 1.8):

1)  Lewis John McGibbney on TIKA-894 is going to add a war/webapp.

2)  I plan to open an issue related to TIKA-1330 that will make our current 
jax-rs tika-server more robust to OOM and permanent hangs, i.e. the server 
process will shut itself down if it encounters either of these, and a watcher 
process will restart the server process… as currently happens in the dev 
version of TIKA-1330.

  This is an interest close to my heart, and I look forward to hearing how 
others are handling this.

  Best,

   Tim

From: Peter Bowyer [mailto:pe...@mapledesign.co.uk]
Sent: Thursday, January 08, 2015 6:47 AM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Running tika-server as a service

Hi,

I want to ensure tika-server is always running, and continues to after restarts 
etc.

I have a hacked together an init script (this being CentOS release 6.6) that 
seems to work (it's running, though not restarted the server yet to test) but 
it's an ugly way to manage things.

How do you keep tika-server running? A daemon manager like daemon tools?  
Handcrafted init.d/upstart/systemd scripts? Is anyone able to share what they 
use?

Thanks,
Peter

JAX-RS: SEVERE Problem with writing the data when parser hits exception?

2015-02-27 Thread Allison, Timothy B.

All,

I recently noticed that I'm getting this message logged when there is an 
exception during parsing:

SEVERE: Problem with writing the data, class 
org.apache.tika.server.TikaResource$5, ContentType: text/html

We didn't get this message with Tika 1.6, but we are getting this with Tika 1.7 
and trunk.
Is this to be expected?

Full stack trace is below.  The test document that triggered this is an 
encrypted PDF document.




WARNING: tika: Text extraction failed
org.apache.tika.exception.TikaException: Unable to extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:150)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256
)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117
)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256
)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
20)
at org.apache.tika.server.TikaResource$5.write(TikaResource.java:368)
at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataPr
ovider.java:164)
at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.jav
a:1363)
at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage
(JAXRSOutInterceptor.java:244)
at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(
JAXRSOutInterceptor.java:117)
at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JA
XRSOutInterceptor.java:80)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseIntercept
orChain.java:307)
at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(Out
goingChainInterceptor.java:83)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseIntercept
orChain.java:307)
at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainIniti
ationObserver.java:121)
at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(Abstract
HTTPDestination.java:251)
at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(Je
ttyHTTPDestination.java:261)
at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTP
Handler.java:70)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandl
er.java:1088)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandle
r.java:1024)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.j
ava:135)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(Cont
extHandlerCollection.java:255)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper
.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(Abstrac
tHttpConnection.java:494)
at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpC
onnection.java:982)
at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.conten
t(AbstractHttpConnection.java:1043)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)

at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnecti
on.java:82)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEn
dPoint.java:696)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEnd
Point.java:53)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPoo
l.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool
.java:543)
at java.lang.Thread.run(Unknown Source)
Caused by: java.io.IOException
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:109)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:379)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:291)
at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:22
5)
at org.apache.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.ja
va:117)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
ne.java:251)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
ne.java:235)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.
java:215)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.ja
va:460)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.j
ava:385)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java
:344)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134)
... 35 more
Caused by: java.util.zip.DataForm

RE: JAX-RS: SEVERE Problem with writing the data when parser hits exception?

2015-02-27 Thread Allison, Timothy B.

Hi Sergey,
  Thank you for responding so quickly.  It seems odd to get a "write exception" 
in addition to the parse exception.  I recently centralized _nearly_ all calls 
to parse and added a custom ExceptionMapper.  We could handle it there, if we 
wanted.
  However, if you're not batting an eye at the warning, I'm happy to ignore the 
logs.  Thank you!

  Best,

   Tim

-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] 
Sent: Friday, February 27, 2015 10:23 AM
To: user@tika.apache.org
Subject: Re: JAX-RS: SEVERE Problem with writing the data when parser hits 
exception?

Hi Tim,

The problem appears to be happening  during a write process, when a 
JAX-RS runtime provider delegates back to JAX-RS StreamingOutput 
TikaResource implementation.
I'm presuming this causes an actual exception reporting.

Do you think it should not be reported/logged ? This can be easily done, 
if the parser throws the exception then this exception can be propagated 
(wrapped if it is not RuntimeException) and caught with a custom 
exception mapper and the logging being blocked...

Cheers, Sergey

On 27/02/15 15:05, Allison, Timothy B. wrote:
> All,
>
> I recently noticed that I'm getting this message logged when there is an
> exception during parsing:
>
> SEVERE: Problem with writing the data, class
> org.apache.tika.server.TikaResource$5, ContentType: text/html
>
> We didn't get this message with Tika 1.6, but we are getting this with
> Tika 1.7 and trunk.
>
> Is this to be expected?
>
> Full stack trace is below.  The test document that triggered this is an
> encrypted PDF document.
>
> WARNING: tika: Text extraction failed
>
> org.apache.tika.exception.TikaException: Unable to extract PDF content
>
>  at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:150)
>
>  at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146)
>
>  at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256
>
> )
>
>  at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117
>
> )
>
>  at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256
>
> )
>
>  at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
>
> 20)
>
>  at
> org.apache.tika.server.TikaResource$5.write(TikaResource.java:368)
>
>  at
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataPr
>
> ovider.java:164)
>
>  at
> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.jav
>
> a:1363)
>
>  at
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage
>
> (JAXRSOutInterceptor.java:244)
>
>  at
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(
>
> JAXRSOutInterceptor.java:117)
>
>  at
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JA
>
> XRSOutInterceptor.java:80)
>
>  at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseIntercept
>
> orChain.java:307)
>
>  at
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(Out
>
> goingChainInterceptor.java:83)
>
>  at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseIntercept
>
> orChain.java:307)
>
>  at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainIniti
>
> ationObserver.java:121)
>
>  at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(Abstract
>
> HTTPDestination.java:251)
>
>  at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(Je
>
> ttyHTTPDestination.java:261)
>
>  at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTP
>
> Handler.java:70)
>
>  at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandl
>
> er.java:1088)
>
>  at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandle
>
> r.java:1024)
>
>  at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.j
>
> ava:135)
>
>  at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(Cont
>
> extHandlerCollection.java:255)
>
>  at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper
>
> .java:116)
>
>  at org.eclipse.jetty.server.Server.handle(Server.java:370)
>
>  at
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(Abstrac
>
> tHttpConnection.java:494)
>
>  at
> org.eclipse.jetty.server.Abstrac

RE: Config for Tika Windows Service with Apache Commons Daemon

2015-03-04 Thread Allison, Timothy B.

Somewhere on my todo list is to add the ability to stop tika-server on the 
commandline.   I probably won't get to this for a few months, though.

I agree with Nick's recommendation to contribute to the war, if at all possible.

-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org] 
Sent: Wednesday, March 04, 2015 1:22 AM
To: user@tika.apache.org
Subject: Re: Config for Tika Windows Service with Apache Commons Daemon

On Wed, 4 Mar 2015, Jason wrote:
> I can get Tika started as a service, but I can't determine what to use for
> a stop method.

There isn't really a stop method. As it stands, the Tika Server runs in a 
single process, started from the main method. To close it down, send it 
control+c or a kill signal

> prunsrv.exe //IS//tika-daemon --DisplayName "Tika Daemon" --Classpath
> "C:\Tika Service\tika-server-1.7.jar" --StartClass
> "org.apache.tika.server.TikaServerCli" --StopClass
> "org.apache.tika.server.TikaServerCli" --StartMethod main --StopMethod main
> --Description "Tika Daemon Windows Service" --StartMode java --StopMode java

At first glance, that looks like the settings for starting and stopping a 
service which forks into the background, which isn't quite what you need 
for the Tika Server.

Maybe the fix is for you to help with the "war" mode for Tika (TIKA-894), 
then deploy that the normal way for Tomcat or Jetty?

Nick

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.

This sounds like a Tika issue, let's move discussion to that list.

If you are still having problems after you upgrade to Tika 1.8, please at least 
submit the stack traces (if you can) to the Tika jira.  We may be able to find 
a document that triggers that stack trace in govdocs1 or the slice of 
CommonCrawl that Julien Nioche contributed to our eval effort.

Tika is not perfect and it will fail on some files, but we are always working 
to improve it.

Best,

  Tim

-Original Message-
From: Vijaya Narayana Reddy Bhoomi Reddy 
[mailto:vijaya.bhoomire...@whishworks.com] 
Sent: Thursday, April 16, 2015 7:44 AM
To: solr-u...@lucene.apache.org
Subject: Re: Indexing PDF and MS Office files

Thanks Allison.

I tried with the mentioned changes. But still no luck. I am using the code
from lucidworks site provided by Erick and now included the changes
mentioned by you. But still the issue persists with a small percentage of
documents (both PDF and MS Office documents) failing. Unfortunately, these
documents are proprietary and client-confidential and hence I am not sure
whether they can be uploaded into Jira.

These files normally open in Adobe Reader and MS Office tools.

Thanks & Regards
Vijay


On 16 April 2015 at 12:33, Allison, Timothy B.  wrote:

> I entirely agree with Erick -- it is best to isolate Tika in its own jvm
> if you can -- bad things can happen if you don't [1] [2].
>
> Erick's blog on SolrJ is fantastic.  If you want to have Tika parse
> embedded documents/attachments, make sure to set the parser in the
> ParseContext before parsing:
>
> ParseContext context = new ParseContext();
> //add this line:
> context.set(Parser.class, _autoParser)
>  InputStream input = new FileInputStream(file);
>
> Tika 1.8 is soon to be released.  If that doesn't fix your problems,
> please submit stacktraces (and docs, if possible) to the Tika jira, and
> we'll try to make the fixes.
>
> Cheers,
>
> Tim
>
> [1]
> http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf
> [2]
> http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
> -Original Message-
> From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
> vijaya.bhoomire...@whishworks.com]
> Sent: Thursday, April 16, 2015 7:10 AM
> To: solr-u...@lucene.apache.org
> Subject: Re: Indexing PDF and MS Office files
>
> Erick,
>
> I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
> SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
> are getting parsed properly and indexed into Solr. However, a minority of
> them keep failing wither PDFParser or OfficeParser error.
>
> Not sure if this behaviour can be modified so that all the documents can be
> indexed. The business requirement we have is to index all the documents.
> However, if a small percentage of them fails, not sure what other ways
> exist to index them.
>
> Any help please?
>
>
> Thanks & Regards
> Vijay
>
>
>
> On 15 April 2015 at 15:20, Erick Erickson  wrote:
>
> > There's quite a discussion here:
> > https://issues.apache.org/jira/browse/SOLR-7137
> >
> > But, I personally am not a huge fan of pushing all the work on to Solr,
> in
> > a
> > production environment the Solr server is responsible for indexing,
> > parsing the
> > docs through Tika, perhaps searching etc. This doesn't scale all that
> well.
> >
> > So an alternative is to use SolrJ with Tika, which is totally independent
> > of
> > what version of Tika is on the Solr server. Here's an example.
> >
> > http://lucidworks.com/blog/indexing-with-solrj/
> >
> > Best,
> > Erick
> >
> > On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
> >  wrote:
> > > Thanks everyone for the responses. Now I am able to index PDF documents
> > > successfully. I have implemented manual extraction using Tika's
> > AutoParser
> > > and PDF functionality is working fine. However,  the error with some MS
> > > office word documents still persist.
> > >
> > > The error message is "java.lang.IllegalArgumentException: This
> paragraph
> > is
> > > not the first one in the table" which will eventually result in
> > "Unexpected
> > > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser"
> > >
> > > Upon some reading, it looks like its a bug with Tika 1.5 and seems to
> > have
> > > been fixed with Tika 1.6 (
> > https://issues.apache.org/jira/browse/TIKA-1251 ).
> > > I am new to Solr / Tika and hence

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.

Let's move this to the Tika users' list.  

I'm aware that [1] is quite common in govdocs1, and it might (?) be the source 
of your problem with MSWord files.

If you can share a stack trace, we'll be better able to diagnose.  

Best,

Tim


[1]
org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
unknown compression method
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:65)
at 
o.a.t.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:264)



-Original Message-
From: Vijaya Narayana Reddy Bhoomi Reddy 
[mailto:vijaya.bhoomire...@whishworks.com] 
Sent: Thursday, April 16, 2015 9:17 AM
To: solr-u...@lucene.apache.org
Subject: Re: Indexing PDF and MS Office files

For MS Word documents, one common pattern for all failed documents I
noticed is that all of them contain embedded images (like scanned signature
images embedded into the documents. These documents are much like some
letterheads where someone scanned the signature image and then embedded
into the document along with the text) with in the documents.

For other documents which completed successfully, no images were present.
Just wondering if these are causing the issue.


Thanks & Regards
Vijay



On 16 April 2015 at 12:58, Vijaya Narayana Reddy Bhoomi Reddy <
vijaya.bhoomire...@whishworks.com> wrote:

> Thanks Tim.
>
> I shall raise a Jira with the stack trace information.
>
> Thanks & Regards
> Vijay
>
>
> On 16 April 2015 at 12:54, Allison, Timothy B.  wrote:
>
>> This sounds like a Tika issue, let's move discussion to that list.
>>
>> If you are still having problems after you upgrade to Tika 1.8, please at
>> least submit the stack traces (if you can) to the Tika jira.  We may be
>> able to find a document that triggers that stack trace in govdocs1 or the
>> slice of CommonCrawl that Julien Nioche contributed to our eval effort.
>>
>> Tika is not perfect and it will fail on some files, but we are always
>> working to improve it.
>>
>> Best,
>>
>>   Tim
>>
>> -Original Message-
>> From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
>> vijaya.bhoomire...@whishworks.com]
>> Sent: Thursday, April 16, 2015 7:44 AM
>> To: solr-u...@lucene.apache.org
>> Subject: Re: Indexing PDF and MS Office files
>>
>> Thanks Allison.
>>
>> I tried with the mentioned changes. But still no luck. I am using the code
>> from lucidworks site provided by Erick and now included the changes
>> mentioned by you. But still the issue persists with a small percentage of
>> documents (both PDF and MS Office documents) failing. Unfortunately, these
>> documents are proprietary and client-confidential and hence I am not sure
>> whether they can be uploaded into Jira.
>>
>> These files normally open in Adobe Reader and MS Office tools.
>>
>> Thanks & Regards
>> Vijay
>>
>>
>> On 16 April 2015 at 12:33, Allison, Timothy B. 
>> wrote:
>>
>> > I entirely agree with Erick -- it is best to isolate Tika in its own jvm
>> > if you can -- bad things can happen if you don't [1] [2].
>> >
>> > Erick's blog on SolrJ is fantastic.  If you want to have Tika parse
>> > embedded documents/attachments, make sure to set the parser in the
>> > ParseContext before parsing:
>> >
>> > ParseContext context = new ParseContext();
>> > //add this line:
>> > context.set(Parser.class, _autoParser)
>> >  InputStream input = new FileInputStream(file);
>> >
>> > Tika 1.8 is soon to be released.  If that doesn't fix your problems,
>> > please submit stacktraces (and docs, if possible) to the Tika jira, and
>> > we'll try to make the fixes.
>> >
>> > Cheers,
>> >
>> > Tim
>> >
>> > [1]
>> >
>> http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf
>> > [2]
>> >
>> http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
>> > -Original Message-
>> > From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
>> > vijaya.bhoomire...@whishworks.com]
>> > Sent: Thursday, April 16, 2015 7:10 AM
>> > To: solr-u...@lucene.apache.org
>> > Subject: Re: Indexing PDF and MS Office files
>> >
>> > Erick,
>> >
>> > I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
>> > SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
>> > are getting parsed properly and

FW: TIKA OCR not working

2015-04-27 Thread Allison, Timothy B.

Trung,

I haven't experimented with our OCR parser yet, but this should give a good 
start: https://wiki.apache.org/tika/TikaOCR .

Have you installed tesseract?

Tika colleagues,
  Any other tips?  What else has to be configured and how?

-Original Message-
From: trung.ht [mailto:trung...@anlab.vn] 
Sent: Friday, April 24, 2015 11:22 PM
To: solr-u...@lucene.apache.org
Subject: Re: TIKA OCR not working

HI everyone,

Does anyone have the answer for this problem :)?


I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika 1.7,
> but it looks like it does not work. Does anyone know that TIKA OCR works
> automatically with Solr or I have to change some settings?
>
>>
Trung.


> It's not clear if OCR would happen automatically in Solr Cell, or if
>> changes to Solr would be needed.
>>
>> For Tika OCR info, see:
>>
>> https://issues.apache.org/jira/browse/TIKA-93
>> https://wiki.apache.org/tika/TikaOCR
>>
>>
>>
>> -- Jack Krupansky
>>
>> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <
>> arafa...@gmail.com>
>> wrote:
>>
>> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't seen
>> it
>> > in use yet.
>> >
>> > Regards,
>> > Alex
>> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" 
>> wrote:
>> >
>> > > Hi Trung,
>> > >
>> > > I didn't know about OCR capabilities of tika.
>> > > Someone who is familiar with sold-cell can inform us whether this
>> > > functionality is added to solr or not.
>> > >
>> > > Ahmet
>> > >
>> > >
>> > >
>> > > On Thursday, April 23, 2015 2:06 PM, trung.ht 
>> wrote:
>> > > Hi Ahmet,
>> > >
>> > > I used a png file, not a pdf file. From the document, I understand
>> that
>> > > solr will post the file to tika, and since tika 1.7, OCR is included.
>> Is
>> > > there something I misunderstood.
>> > >
>> > > Trung.
>> > >
>> > >
>> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
>> > > >
>> > > wrote:
>> > >
>> > > > Hi Trung,
>> > > >
>> > > > solr-cell (tika) does not do OCR. It cannot exact text from image
>> based
>> > > > pdfs.
>> > > >
>> > > > Ahmet
>> > > >
>> > > >
>> > > >
>> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht 
>> > wrote:
>> > > >
>> > > >
>> > > >
>> > > > Hi,
>> > > >
>> > > > I want to use solr to index some scanned document, after settings
>> solr
>> > > > document with a two field "content" and "filename", I tried to
>> upload
>> > the
>> > > > attached file, but it seems that the content of the file is only
>> "\n \n
>> > > > \n".
>> > > > But if I used the tesseract from command line I got the result
>> > correctly.
>> > > >
>> > > > The log when solr receive my request:
>> > > > ---
>> > > > INFO  - 2015-04-23 03:49:25.941;
>> > > > org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
>> > > > webapp=/solr path=/update/extract params={literal.groupid=2&json.nl
>> > > =flat&
>> > > > resource.name=phplNiPrs&literal.id
>> > > >
>> > >
>> >
>> =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
>> > > >
>> > > > 
>> > > >
>> > > > The document when I check on solr admin page:
>> > > > -
>> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
>> "createddate":
>> > > > "2015-04-22T15:00:00Z", "filename":
>> > "trunght\\test\\tesseract_3.png",
>> > > > "autocomplete_text": [ "trunght\\test\\tesseract_3.png" ],
>> > > "content": "
>> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>> \n
>> > \n
>> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>> \n
>> > ",
>> > > > "_version_": 1499213034586898400 }
>> > > >
>> > > > ---
>> > > >
>> > > > Since I am a solr newbie I do not know where to look, can anyone
>> give
>> > me
>> > > > an advice for where to look for error or settings to make it work.
>> > > > Thanks in advanced.
>> > > >
>> > > > Trung.
>> > > >
>> > >
>> >
>>
>
>

RE: Odp.: solr issue with pdf forms

2015-04-29 Thread Allison, Timothy B.

I completely agree with Erick about the utility of the TermsComponent to see 
what is actually being indexed.  If you find problems there and if you haven't 
done so already, you might also investigate further down the stack.  It might 
make sense to run the tika-app.jar (whichever version you are using in DIH or 
other mechanism?) or even the pdfbox-app.jar (ExtractText option) on your files 
outside of Solr to see what text/noise you're getting for the files that are 
causing problems.



-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, April 28, 2015 9:07 PM
To: solr-u...@lucene.apache.org
Subject: Re: Odp.: solr issue with pdf forms

There better be.

1> go to the admin UI
2> select a core
3> select "schema browser"
4> select a field from the drop-down

Until you do step 4 the window will be pretty blank.

Here's the info for TermsComponent, what have you tried?

https://cwiki.apache.org/confluence/display/solr/The+Terms+Component

Best,
Erick

On Tue, Apr 28, 2015 at 1:04 PM,   wrote:
> Thanks a lot for being patient with me. Unfortunately there is no button 
> "load term info". :-(
> Can you may be help me using the TermsComponent instead? I read it is per 
> default configured.
>
> Thanks a lot
> Best
> Steve
>
> -Ursprüngliche Nachricht-
> Von: Erick Erickson [mailto:erickerick...@gmail.com]
> Gesendet: Montag, 27. April 2015 17:23
> An: solr-u...@lucene.apache.org
> Betreff: Re: Odp.: solr issue with pdf forms
>
> We're still not quite there. There should be a "load term info" button on 
> that page. Clicking that button will show you the terms in your index (as 
> opposed to the raw stored input which is what you get when you look at 
> results in the browser). My bet is that you'll see perfectly normal tokens in 
> the index that will NOT have the wonky characters you see in the display.
>
> If that's the case, then you have a browser issue, Solr is working perfectly 
> fine. On the other hand, if the individual terms are weird, then you have 
> something more fundamental going on.
>
> Which is why I mentioned the TermsComponent. That will return indexed tokens, 
> and allows you a bit more flexibility than the admin page in terms of what 
> tokens you see, but it's essentially the same information.
>
> Best,
> Erick
>
> On Sun, Apr 26, 2015 at 11:18 PM,   wrote:
>> Erick,
>>
>> thanks a lot for helping me here. In my case it ist he "content" field which 
>> is displayed not correctly. So I went tot he schema browser like you pointed 
>> out. Here ist he information I found:
>> Field: content
>> Field Type: text
>> Properties:  Indexed, Tokenized, Stored, TermVector Stored
>> Schema:  Indexed, Tokenized, Stored, TermVector Stored
>> Index:  Indexed, Tokenized, Stored, TermVector Stored Copied Into:
>> spell teaser Position Increment Gap:  100 Index Analyzer:
>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>> Filters:
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
>> catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>> catenateAll: 0 catenateNumbers: 1 }
>> org.apache.solr.analysis.LowerCaseFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.SynonymFilterFactory args:{synonyms:
>> german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion:
>> LUCENE_36 }
>> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory
>> args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4
>> minWordSize: 5 dictionary: german/german-common-nouns.txt
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.StopFilterFactory args:{words:
>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.GermanNormalizationFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 } Query Analyzer:
>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>> Filters:
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
>> catenateWords: 0 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>> catenateAll: 0 catenateNumbers: 0 }
>> org.apache.solr.analysis.LowerCaseFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.StopFilterFactory args:{words:
>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.GermanNormalizationFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>>

RE: Odp.: solr issue with pdf forms

2015-04-30 Thread Allison, Timothy B.

Is that a literal ^ followed by H?  Out of curiosity, is 
"Bitte^Hlegen^HSie^Hdem^HAntrag" indexed as one token, or is it indexed as (I 
guess it depends on your analysis chain...):
 
Bitte
Hlegen
HSie
Hdem
HAntrag

Might want to open an issue on PDFBox's jira.  Some things can be easily fixed; 
sometimes the text with the PDF file is just plain corrupt. :)

Cheers,

Tim

-Original Message-
From: steve.sch...@t-systems.com [mailto:steve.sch...@t-systems.com] 
Sent: Thursday, April 30, 2015 3:03 AM
To: solr-u...@lucene.apache.org
Subject: AW: Odp.: solr issue with pdf forms

Hey, thanks a lot for the hint with pdfbox-app.jar.
For testing purpose I now extracted a affected pdf form and a usual pdf file.
The result ist he following:

Usual pdf file:
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod 
tempor invidunt ut
labore et d

pdf form:
Bitte^Hlegen^HSie^Hdem^HAntrag Kopien aller Einkommensnachweise bei.^HDaz

Best
Steve

-Ursprüngliche Nachricht-----
Von: Allison, Timothy B. [mailto:talli...@mitre.org] 
Gesendet: Mittwoch, 29. April 2015 14:16
An: solr-u...@lucene.apache.org
Cc: user@tika.apache.org
Betreff: RE: Odp.: solr issue with pdf forms

I completely agree with Erick about the utility of the TermsComponent to see 
what is actually being indexed.  If you find problems there and if you haven't 
done so already, you might also investigate further down the stack.  It might 
make sense to run the tika-app.jar (whichever version you are using in DIH or 
other mechanism?) or even the pdfbox-app.jar (ExtractText option) on your files 
outside of Solr to see what text/noise you're getting for the files that are 
causing problems.



-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Tuesday, April 28, 2015 9:07 PM
To: solr-u...@lucene.apache.org
Subject: Re: Odp.: solr issue with pdf forms

There better be.

1> go to the admin UI
2> select a core
3> select "schema browser"
4> select a field from the drop-down

Until you do step 4 the window will be pretty blank.

Here's the info for TermsComponent, what have you tried?

https://cwiki.apache.org/confluence/display/solr/The+Terms+Component

Best,
Erick

On Tue, Apr 28, 2015 at 1:04 PM,   wrote:
> Thanks a lot for being patient with me. Unfortunately there is no 
> button "load term info". :-( Can you may be help me using the TermsComponent 
> instead? I read it is per default configured.
>
> Thanks a lot
> Best
> Steve
>
> -Ursprüngliche Nachricht-
> Von: Erick Erickson [mailto:erickerick...@gmail.com]
> Gesendet: Montag, 27. April 2015 17:23
> An: solr-u...@lucene.apache.org
> Betreff: Re: Odp.: solr issue with pdf forms
>
> We're still not quite there. There should be a "load term info" button on 
> that page. Clicking that button will show you the terms in your index (as 
> opposed to the raw stored input which is what you get when you look at 
> results in the browser). My bet is that you'll see perfectly normal tokens in 
> the index that will NOT have the wonky characters you see in the display.
>
> If that's the case, then you have a browser issue, Solr is working perfectly 
> fine. On the other hand, if the individual terms are weird, then you have 
> something more fundamental going on.
>
> Which is why I mentioned the TermsComponent. That will return indexed tokens, 
> and allows you a bit more flexibility than the admin page in terms of what 
> tokens you see, but it's essentially the same information.
>
> Best,
> Erick
>
> On Sun, Apr 26, 2015 at 11:18 PM,   wrote:
>> Erick,
>>
>> thanks a lot for helping me here. In my case it ist he "content" field which 
>> is displayed not correctly. So I went tot he schema browser like you pointed 
>> out. Here ist he information I found:
>> Field: content
>> Field Type: text
>> Properties:  Indexed, Tokenized, Stored, TermVector Stored
>> Schema:  Indexed, Tokenized, Stored, TermVector Stored
>> Index:  Indexed, Tokenized, Stored, TermVector Stored Copied Into:
>> spell teaser Position Increment Gap:  100 Index Analyzer:
>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>> Filters:
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
>> catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>> catenateAll: 0 catenateNumbers: 1 }
>> org.apache.solr.analysis.LowerCaseFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 } 
>> org.apache.solr.analysis.SynonymFilterFactory args:{synonyms:
>> german/synonyms.txt expand: true ignoreC

RE: extracting text from an "encrypted" pdf

2015-05-12 Thread Allison, Timothy B.


PDF encryption and access permissions are tricky (see, e.g., the discussion and 
links here: https://issues.apache.org/jira/browse/TIKA-1489 ).  There are 
potentially two passwords for a PDF document, the owner password and the user 
password.  Often, the user password is set to the empty string...this allows 
the owner to modify the document but can effectively give "read" access to the 
user.

Aside from encryption, but related to it, a PDF file has various 
AccessPermissions.  Among other permissions, an owner can specify whether or 
not text should be extracted and/or whether or not text should be extracted for 
accessibility.  As of Tika 1.8, you can have Tika respect these permissions by 
sending in an AccessChecker via the ParseContext.


1) What ist he preferred way to extract text from a 
pdf("-that-can-be-read-in-AcrobatReader")? 

If you only want text from the PDFDocument (not attachments/embedded documents) 
and you are only parsing PDFs, then it might make sense to use pure PDFBox. 
 I haven't checked recently, but I _think_ that Tika may be 
pulling out some text from annotations or maybe AcroFields that PDFTextStripper 
isn't . ..I can look into this if it matters to you. Tika also 
extracts normalized metadata and does a bit more with metadata than if you were 
using the PDFTextStripper.

2) Does the second approach possibly return more than text? Blobs? Binary data?
The second approach will leverage the full power of Tika to extract content 
from embedded documents/attachments.  The first approach will only extract text 
from the outer pdf document.   You can extract binary data (embedded images or 
other embedded files) in Tika by sending in an EmbeddedDocumentExtractor 
instead of the Parser.class.

RE: Embedded images in PDF - detect, extract and/or OCR

2015-05-13 Thread Allison, Timothy B.

By default, Tika is configured not to extract embedded images from PDFs because 
in some edge cases, there can be thousands of images in some small PDF files 
(see https://issues.apache.org/jira/browse/TIKA-1294).  Our choice to have the 
default be “don’t extract” was based on the concern that if we made the change, 
devops folks in large document processing pipelines might be surprised by 
memory consumption and far slower parsing.

To configure Tika to extract embedded images, you can configure a 
PDFParserConfig (setExtractInlineImages(true)) and attach that to a 
ParseContext before the parse, or (if you are just using tika-app) you can set 
that value manually in in the app jar in o.a.t.parser.pdf.PDFParser.properties.

I’m haven’t tested whether our OCR parser will process those embedded images, 
but it should.

Let me know if this helps.

From: Stefan Alder [mailto:twigbra...@gmail.com]
Sent: Wednesday, May 13, 2015 3:04 PM
To: user@tika.apache.org
Subject: Embedded images in PDF - detect, extract and/or OCR

Ultimately I'm trying to (1) determine whether images, particularly, full page 
images, are embedded in a pdf, and (2) extract the images and/or (3) OCR the 
text.

Does tika-app support this?  When I run java -jar tika-app-1.8.jar test.pdf, I 
get all of the meta data, and see  tags but no images.

Running with -z doesn't output any images.

RE: Embedded images in PDF - detect, extract and/or OCR

2015-05-13 Thread Allison, Timothy B.

Hi Stefan,


1)  Right, out of the box, tika-app does not provide information about 
whether an embedded/inline image exists.  It will handle “attached” images as 
all other parsers do out of the box, but not embedded/inline images.

2)  Disabled for inline images, but not for regular attachments.

3)  At this point, no.  One hack is to unzip the app jar and just change 
the values in the properties file and rezip the jar.  On the horizon, I’d like 
to make a common interface for parser configuration so that you can set parser 
config parameters via the regular tika config file, and then you’d be able to 
specify that at the commandline.


If you do change the properties file, you’ll probably also want to change 
extractUniqueInlineImagesOnly
to “false”.

Cheers,

  Tim

From: Stefan Alder [mailto:twigbra...@gmail.com]
Sent: Wednesday, May 13, 2015 3:30 PM
To: user@tika.apache.org
Subject: Re: Embedded images in PDF - detect, extract and/or OCR

To clarify,
(1) tika-app, as compiled, does not provide any indication that an image exists 
within a pdf? (my main interest are entire page images for PDFs that were 
scanned). Again, my first interest is detecting whether embedded images exist.
(2) the -z option is effectively disabled for PDFs?
(3) is there a way to enable detection and/or extraction from the command line, 
as opposed to editing the source?



On Wed, May 13, 2015 at 12:18 PM, Allison, Timothy B. 
mailto:talli...@mitre.org>> wrote:
By default, Tika is configured not to extract embedded images from PDFs because 
in some edge cases, there can be thousands of images in some small PDF files 
(see https://issues.apache.org/jira/browse/TIKA-1294).  Our choice to have the 
default be “don’t extract” was based on the concern that if we made the change, 
devops folks in large document processing pipelines might be surprised by 
memory consumption and far slower parsing.

To configure Tika to extract embedded images, you can configure a 
PDFParserConfig (setExtractInlineImages(true)) and attach that to a 
ParseContext before the parse, or (if you are just using tika-app) you can set 
that value manually in in the app jar in o.a.t.parser.pdf.PDFParser.properties.

I’m haven’t tested whether our OCR parser will process those embedded images, 
but it should.

Let me know if this helps.

From: Stefan Alder [mailto:twigbra...@gmail.com<mailto:twigbra...@gmail.com>]
Sent: Wednesday, May 13, 2015 3:04 PM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Embedded images in PDF - detect, extract and/or OCR

Ultimately I'm trying to (1) determine whether images, particularly, full page 
images, are embedded in a pdf, and (2) extract the images and/or (3) OCR the 
text.

Does tika-app support this?  When I run java -jar tika-app-1.8.jar test.pdf, I 
get all of the meta data, and see  tags but no images.

Running with -z doesn't output any images.

RE: Memory issues with PDF parser

2015-06-04 Thread Allison, Timothy B.

Hi Mouthgalya,
  We fixed that NPE in https://issues.apache.org/jira/browse/TIKA-1605, and the 
fix will be available in Tika 1.9, which should be out within a week.
  As for memory issues, we worked around a memory leak in PDFBox with static 
caching of fonts for Tika 1.7 (may have been 1.8), but there may be others.  
One potential memory hog is the processing of inline images within PDFs...have 
you configured Tika to pull those out (default is to skip them)?  Other than 
that, I'd recommend dropping a note to the PDFBox users list to get help in 
diagnosing memory consumption with PDFBox.  Have you tried any memory profiling?

  Best,

Tim

From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapa...@fitchratings.com]
Sent: Wednesday, June 03, 2015 3:25 PM
To: talli...@apache.org
Subject: Memory issues with PDF parser

Hi all,
I am trying to use Apache tika 1.8 for extracting contents from pdf. I have the 
below code for extracting it. It works well for few files. But if I read many 
files , I see out of memory exception.
I also see a Null pointer exception in the pdf parser. I think the null pointer 
exception is because of the memory exception.
Any suggestions?

Tika version:
  
 org.apache.tika
 tika-server
 1.8


I am running it as a part of J2EE APP in JBoss 1.7

Code:-

//Parse the pdf content using Apache Tikka
InputStream is = null;
try {
  is = new BufferedInputStream(new FileInputStream(input));
  //Disable write limit.
  contenthandler = new BodyContentHandler(-1);
   metadata = new Metadata();
  pdfparser = new PDFParser();
  context = new ParseContext();
  pdfparser.parse(is, contenthandler, metadata, context);
  docBody=contenthandler.toString();
  //System.out.println(contenthandler.toString());
}
catch (Exception e) {
   System.out.println("Exception in updating docbody for report ==> 
" + report.getDocID());
   if(is==null)
 System.out.println("The input stream is a null object");
   e.printStackTrace();
  logger.log(Level.SEVERE, e.getMessage(), e);
}
finally {
if (is != null) is.close();
contenthandler=null;
metadata=null;
pdfparser=null;
context =null;
}


Exception:-
I am just including the null pointer exception in the parser below.

10:53:11,696 INFO  [stdout] (Thread-11 
(HornetQ-client-global-threads-1619682129)) Exception in updating docbody for 
report ==> RPT_764268
10:53:12,218 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129)) java.lang.NullPointerException
10:53:12,219 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158)
10:53:12,219 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
com.fitch.researchapi.dao.ResearchReportMDAO.updateDocBody(ResearchReportMDAO.java:881)
10:53:12,219 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
com.fitch.researchapi.dao.ResearchReportMDAO.loadFile_NEW(ResearchReportMDAO.java:965)
10:53:12,220 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
com.fitch.researchapi.dao.ResearchReportMDAO.upsert_NEW(ResearchReportMDAO.java:676)
10:53:12,220 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
com.fitch.research.ejb.ResearchReportManagerBean.processResearchReport(ResearchReportManagerBean.java:70)
10:53:12,221 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
10:53:12,221 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
10:53:12,222 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
java.lang.reflect.Method.invoke(Method.java:597)
10:53:12,222 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72)
10:53:12,223 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,223 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53)
10:53:12,223 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.process

RE: Memory issues with PDF parser

2015-06-04 Thread Allison, Timothy B.

1)  Right, the npe is caused by the exception returning null when we call 
getMessage().  In TIKA-1605, we modified all code in the project to check for 
null returned by getMessage().  So, in the "fixed" version, you'll still get 
your good old IOException.  I can't tell from your stacktrace what caused the 
IOException.

2)  Y, regular builds of 1.9's app (and other modules) are available via 
Jenkins here: 
https://builds.apache.org/view/Tika/job/tika-trunk-jdk1.7/org.apache.tika$tika-app/

3)  Ok, makes sense.

For kicks, you may want to change opening the file to:
is = TikaInputStream.get(file)
or maybe:
is = TikaInputStream.get(file, metadata)

And you'll want to surround your closing of the IS in a try/catch block.  Or 
use IOUtils.closeQuietly.

Finally, are you able to share the particular file that caused the IOException?
From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapa...@fitchratings.com]
Sent: Thursday, June 04, 2015 10:20 AM
To: Allison, Timothy B.; talli...@apache.org
Cc: user@tika.apache.org; Sauparna Sarkar
Subject: RE: Memory issues with PDF parser

Hi Timothy,
Thanks for the prompt reply.


1.)Wouldn't fixing the null pointer exception in turn throw the IO 
exception? I saw that the null pointer exception was thrown inside the catch 
block of the IO exception? Any root cause for the IO exception??.

Is that also fixed?



I am including the code that threw the null pointer exception in tike 1.8



Exception:
10:53:12,218 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129)) java.lang.NullPointerException
10:53:12,219 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158)



Code in the pdf parser:
catch (IOException e) {
//nonseq parser throws IOException for bad password
//At the Tika level, we want the same exception to be thrown
if (e.getMessage().contains("Error (CryptographyException)")) {
metadata.set("pdf:encrypted", Boolean.toString(true));
throw new EncryptedDocumentException(e);
}


2.)Do you have a snapshot or beta version of tika 1.9 that I could try with 
our pdf corpus? It would also help in your developer testing.

3.)For the inline images, we have just set the defaults(which is to skip 
them as you had mentioned). I have not done any memory profiling till now. I 
will also try that.



Thanks,
MG

From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, June 04, 2015 7:19 AM
To: Mouthgalya Ganapathy; talli...@apache.org<mailto:talli...@apache.org>
Cc: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: RE: Memory issues with PDF parser

Hi Mouthgalya,
  We fixed that NPE in https://issues.apache.org/jira/browse/TIKA-1605, and the 
fix will be available in Tika 1.9, which should be out within a week.
  As for memory issues, we worked around a memory leak in PDFBox with static 
caching of fonts for Tika 1.7 (may have been 1.8), but there may be others.  
One potential memory hog is the processing of inline images within PDFs...have 
you configured Tika to pull those out (default is to skip them)?  Other than 
that, I'd recommend dropping a note to the PDFBox users list to get help in 
diagnosing memory consumption with PDFBox.  Have you tried any memory profiling?

  Best,

Tim

From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapa...@fitchratings.com]
Sent: Wednesday, June 03, 2015 3:25 PM
To: talli...@apache.org<mailto:talli...@apache.org>
Subject: Memory issues with PDF parser

Hi all,
I am trying to use Apache tika 1.8 for extracting contents from pdf. I have the 
below code for extracting it. It works well for few files. But if I read many 
files , I see out of memory exception.
I also see a Null pointer exception in the pdf parser. I think the null pointer 
exception is because of the memory exception.
Any suggestions?

Tika version:
  
 org.apache.tika
 tika-server
 1.8


I am running it as a part of J2EE APP in JBoss 1.7

Code:-

//Parse the pdf content using Apache Tikka
InputStream is = null;
try {
  is = new BufferedInputStream(new FileInputStream(input));
  //Disable write limit.
  contenthandler = new BodyContentHandler(-1);
   metadata = new Metadata();
  pdfparser = new PDFParser();
  context = new ParseContext();
  pdfparser.parse(is, contenthandler, metadata, context);
  docBody=contenthandler.toString();
  //System.out.println(contenthandler.toString());
}
catch (Exception e) {
   System.out.println("Exception in updating docbody for report ==> 
" + report.getDocID());

RE: Memory issues with PDF parser

2015-06-04 Thread Allison, Timothy B.

You will get the same exception.  If you run the pure Tika app commandline on a 
triggering file, does it at least show you the "caused by" clause that might 
give more information?

Other question: Are you sure that you want to avoid parsing attachments?


From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapa...@fitchratings.com]
Sent: Thursday, June 04, 2015 2:55 PM
To: Allison, Timothy B.
Cc: user@tika.apache.org; Sauparna Sarkar
Subject: RE: Memory issues with PDF parser

Thanks for the update Timothy,
I see that Tika 1.9.-SNAPSHOT is available in maven repo. I am going to try 
that and  will use TikaInputStreams. I will update the results.

Given below is the IO exception that I get when I use Autoparser to extract pdf 
contents. I had used Tika 1.6. and pdfbox 1.8.9. I am guessing I will get the 
same/similar exception when I am going to run it with 1.9-SNAPSHOT.

1:27:53,921 WARN  [org.hornetq.core.client.impl.ClientSessionImpl] (Thread-4 
(HornetQ-client-global-threads-248507153)) resetting session after failure
[Server:research-etl-server] 21:29:16,314 INFO  [stdout] (Thread-12 
(HornetQ-client-global-threads-248507153)) Exception in updating docbody for 
report ==> RPT_720610
[Server:research-etl-server] 21:29:23,817 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.pdf.PDFParser@29fe5969<mailto:org.apache.tika.parser.pdf.PDFParser@29fe5969>
[Server:research-etl-server] 21:29:23,818 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:250)
[Server:research-etl-server] 21:29:23,818 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
[Server:research-etl-server] 21:29:23,820 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121)
[Server:research-etl-server] 21:29:23,820 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
com.fitch.researchapi.dao.ResearchReportMDAO.updateDocBody(ResearchReportMDAO.java:888)
[Server:research-etl-server] 21:29:23,820 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
com.fitch.researchapi.dao.ResearchReportMDAO.loadFile_NEW(ResearchReportMDAO.java:983)
[Server:research-etl-server] 21:29:23,821 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
com.fitch.researchapi.dao.ResearchReportMDAO.upsert_NEW(ResearchReportMDAO.java:678)
[Server:research-etl-server] 21:29:23,821 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
com.fitch.research.ejb.ResearchReportManagerBean.processResearchReport(ResearchReportManagerBean.java:70)
[Server:research-etl-server] 21:29:23,822 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
[Server:research-etl-server] 21:29:23,822 WARN  
[org.hornetq.core.server.impl.ServerSessionImpl] (hornetq-failure-check-thread) 
Cleared up resources for session dc692df4-0a50-11e5-8aa3-005056900299
[Server:research-etl-server] 21:29:23,822 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
[Server:research-etl-server] 21:29:23,823 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
java.lang.reflect.Method.invoke(Method.java:597)
[Server:research-etl-server] 21:29:23,823 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72)
[Server:research-etl-server] 21:29:23,823 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
[Server:research-etl-server] 21:29:23,824 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53)
[Server:research-etl-server] 21:29:23,824 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36)
[Server:research-etl-server] 21:29:23,824 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)



Thanks,
Mouthgalya Ganapathy
Product Development Team
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, June 04, 2015 12:50 PM
To: Mouthgalya Ganapathy
Cc: user@tika.apache.org<mailto:user@tika.apache.org>; Sauparna S

xml vs html parser

2015-06-16 Thread Allison, Timothy B.

All,

  On govdocs1, the xml parser's exceptions accounted for nearly a quarter of 
all thrown exceptions at one point (Tika 1.7ish).  Typically, a file was 
mis-identified as xml when in fact it was sgml or some other text based file 
with some markup that wasn't meant to be xml.

  For kicks, I switched  the config to use the HtmlParser for files identified 
as xml.  This got rid of the exceptions, but the content was quite different 
(ballpark 6k files out of 35k files had similarity < 0.95) mostly because of 
elisions "the quick" -> "thequick", and I assume this is across tags...

  So, is there a way to make the XMLParser more lenient?  Or is there a way to 
configure the HtmlParser to add spaces for non-html tags?

  Or, is there a better solution?



 Thank you!



  Best,



 Tim

RE: CSV Parser in Tika

2015-06-19 Thread Allison, Timothy B.

Y, that’s my belief.

As of now, we’re treating them as text files, which can lead to some really 
long = bogus tokens in Lucene/Solr with analyzers that don’t split on commas. ☹

Detection without filename would be difficult.





From: lewis john mcgibbney [mailto:lewi...@apache.org]
Sent: Friday, June 19, 2015 9:59 AM
To: user@tika.apache.org
Subject: CSV Parser in Tika

Hi Folks,
Am I correct in saying that we can't detect CSV in Tika?
We import commons-csv in tika-parsers/pom.xml, however I don't see a csv 
package and registered parser.
Also, when I use the webapp I get the following for a test csv file with 
semicolon ';' separators

Content-Encoding: ISO-8859-1
Content-Length: 217
Content-Type: text/plain; charset=ISO-8859-1
X-Parsed-By: org.apache.tika.parser.DefaultParser
resourceName: test-semicolon.csv
Any comments please?
Thanks
Lewis

RE: xml vs html parser

2015-06-19 Thread Allison, Timothy B.

Jukka,
  Sorry for my delay.

addSpaceBetweenElements  ...exactly what I was looking for.  Thank you.

  I'll send an update after further analysis of the incorrectly identified 
files to see if we can tweak our mimes.

  Cheers,

  Tim

-Original Message-
From: Jukka Zitting [mailto:jukka.zitt...@gmail.com] 
Sent: Tuesday, June 16, 2015 10:26 AM
To: Tika Users
Subject: Re: xml vs html parser

Hi,

2015-06-16 9:28 GMT-04:00 Allison, Timothy B. :
> So, is there a way to make the XMLParser more lenient?

I don't think so. XML is draconian by design.

> Or is there a way to configure the HtmlParser to add spaces for
> non-html tags?

One option that wouldn't require changes in Tika code could be to use
HtmlParser with the IdentityHtmlMapper and process the output using
TextContentHandler with the addSpaceBetweenElements option enabled.

> Or, is there a better solution?

The cleanest alternative would be to come up with a more accurate
detection heuristics to detect SGML.

Are there some common file name patterns, DOCTYPEs or other easily
identifiable bits that could be used to improve the accuracy of type
detection?

Things like the  header, presence of xmlns attributes, the
.xml file extension, etc. can be used as highly reliable signals for
XML content, so the lack of them coupled with even some fairly weak
SGML detection signals (stuff like upper case element names?) might be
enough to get significant improvements in this area.

BR,

Jukka Zitting

RE: Extract PDF inline images

2015-07-06 Thread Allison, Timothy B.

Hi Andrea,

  The RecursiveParserWrapper, as you found, is only for extracted content and 
metadata.   It was designed to cache metadata and content from embedded 
documents so that you can easily keep those two things together for each 
embedded document.

  To extract the raw bytes from embedded files, try implementing an 
EmbeddedDocumentExtractor and passing that into the ParseContext.  Take a look 
at 
http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java
 and specifically the inner class MyEmbeddedDocument extractor for an example.  
As another example, look at 
http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java,
 and specifically the inner class: FileEmbeddedDocumentExtractor


Basically, in ParseEmbedded, just copy the InputStream to a FileOutputStream, 
and you should be good to go.

public boolean shouldParseEmbedded(Metadata metadata) {
return true;
}

public void parseEmbedded(InputStream inputStream, ContentHandler 
contentHandler, Metadata metadata, boolean b) throws SAXException, IOException {

  Best,

   Tim

From: Andrea Asta [mailto:asta.and...@gmail.com]
Sent: Monday, July 06, 2015 6:11 AM
To: user@tika.apache.org
Subject: Extract PDF inline images

Hello,
I'm trying to store the inline images from a PDF to a local folder, but can't 
find any valid example. I can only use the RecursiveParserWrapper to get all 
the available metadata, but not the binary image content.
This is my code:

RecursiveParserWrapper parser = new RecursiveParserWrapper(
  new AutoDetectParser(),
  new BasicContentHandlerFactory(HANDLER_TYPE.XML, -1)
);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
PDFParserConfig config = new PDFParserConfig();
PDFParser p;
config.setExtractInlineImages(true);
config.setExtractUniqueInlineImagesOnly(false);
context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config);
context.set(org.apache.tika.parser.Parser.class, parser);

InputStream is = PdfRecursiveExample.class.getResourceAsStream("/BA200PDE.PDF");
//parsing the file
ToXMLContentHandler handler = new ToXMLContentHandler(new FileOutputStream(new 
File("out.txt")), "UTF-8");
parser.parse(is, handler, metadata, context);
How can I store each image file to a folder?
Thanks
Andrea

RE: Extract PDF inline images

2015-07-07 Thread Allison, Timothy B.

Andrea,
  I’m about to commit an example (see TIKA-1674).  In about 10 minutes, look 
for org.apache.tika.example.ExtractEmbeddedFiles in the tika-examples module.
  I’m still a bit stumped though on why my example isn’t working recursively.  
It is only pulling out the children of the input document.  Stay tuned to 
TIKA-1674 for follow up on that.

   Best,

  Tim

From: Andrea Asta [mailto:asta.and...@gmail.com]
Sent: Tuesday, July 07, 2015 6:22 AM
To: user@tika.apache.org
Subject: Re: Extract PDF inline images

Hi Tim,
thanks for your response, but I can't find a complete solution.
I've created a class using the same FileEmbeddedDocumentExtractor from TikaCLI, 
and now I'm trying to do a sample main program with a PDF containing some 
images.
This is my code, but I can't have any image stored and the methods of 
DocumentExtractor are never called using debugger.
Thanks
Andrea

RecursiveParserWrapper parser = new RecursiveParserWrapper(
  new AutoDetectParser(),
  new BasicContentHandlerFactory(HANDLER_TYPE.XML, -1)
);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();

FileEmbeddedDocumentExtractor extractor = new FileEmbeddedDocumentExtractor();
context.set(FileEmbeddedDocumentExtractor.class, extractor);

PDFParserConfig config = new PDFParserConfig();
config.setExtractInlineImages(true);
config.setExtractUniqueInlineImagesOnly(true);
context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config);

context.set(org.apache.tika.parser.Parser.class, new AutoDetectParser());

InputStream is = PdfRecursiveExample.class.getResourceAsStream("/my.PDF");
ToXMLContentHandler handler = new ToXMLContentHandler(new FileOutputStream(new 
File("out.txt")), "UTF-8");
parser.parse(is, handler, metadata, context);

2015-07-06 12:59 GMT+02:00 Allison, Timothy B. 
mailto:talli...@mitre.org>>:
Hi Andrea,

  The RecursiveParserWrapper, as you found, is only for extracted content and 
metadata.   It was designed to cache metadata and content from embedded 
documents so that you can easily keep those two things together for each 
embedded document.

  To extract the raw bytes from embedded files, try implementing an 
EmbeddedDocumentExtractor and passing that into the ParseContext.  Take a look 
at 
http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java
 and specifically the inner class MyEmbeddedDocument extractor for an example.  
As another example, look at 
http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java,
 and specifically the inner class: FileEmbeddedDocumentExtractor


Basically, in ParseEmbedded, just copy the InputStream to a FileOutputStream, 
and you should be good to go.

public boolean shouldParseEmbedded(Metadata metadata) {
return true;
}

public void parseEmbedded(InputStream inputStream, ContentHandler 
contentHandler, Metadata metadata, boolean b) throws SAXException, IOException {

  Best,

   Tim

From: Andrea Asta [mailto:asta.and...@gmail.com<mailto:asta.and...@gmail.com>]
Sent: Monday, July 06, 2015 6:11 AM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Extract PDF inline images

Hello,
I'm trying to store the inline images from a PDF to a local folder, but can't 
find any valid example. I can only use the RecursiveParserWrapper to get all 
the available metadata, but not the binary image content.
This is my code:

RecursiveParserWrapper parser = new RecursiveParserWrapper(
  new AutoDetectParser(),
  new BasicContentHandlerFactory(HANDLER_TYPE.XML, -1)
);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
PDFParserConfig config = new PDFParserConfig();
PDFParser p;
config.setExtractInlineImages(true);
config.setExtractUniqueInlineImagesOnly(false);
context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config);
context.set(org.apache.tika.parser.Parser.class, parser);

InputStream is = PdfRecursiveExample.class.getResourceAsStream("/BA200PDE.PDF");
//parsing the file
ToXMLContentHandler handler = new ToXMLContentHandler(new FileOutputStream(new 
File("out.txt")), "UTF-8");
parser.parse(is, handler, metadata, context);
How can I store each image file to a folder?
Thanks
Andrea

RE: Inconsistent (buggy) behavior when using tika-server

2015-07-14 Thread Allison, Timothy B.

That looks like a bug in TikaUtils.

For whatever reason, when is.available() returns 0, we are then assuming that 
fileUrl is not null.  We need to check to make sure that fileUrl is not null 
before trying to open the file.

if(is.available() == 0 && !"".equals(fileUrl)){
...

return TikaInputStream.get(new URL(fileUrl), metadata);

Would you mind opening a ticket on jira?

All,
  Is there a reason why an inputstream would return 0 for available() but still 
be readable?

Best,

   Tim


From: Malarout, Namrata (398M-Affiliate) [mailto:namrata.malar...@jpl.nasa.gov]
Sent: Tuesday, July 14, 2015 1:35 PM
To: user@tika.apache.org
Subject: Inconsistent (buggy) behavior when using tika-server

Hi Folks,
I am using Tika trunk (1.10-SNAPSHOT) and posting documents there. An example 
would be the following:


curl -T MOD09GA.A2014010.h30v12.005.2014012183944.vegetation_fraction.tif  
http://localhost:9998/meta --header "Accept: application/json"

...

curl -T MOD09GA.A2014010.h30v12.005.2014012183944.vegetation_fraction.tif  
http://localhost:9998/meta --header "Accept: application/rdf+xml"

...

curl -T MOD09GA.A2014010.h30v12.005.2014012183944.vegetation_fraction.tif  
http://localhost:9998/meta --header "Accept: text/csv"



I am using a python script to iterate through all the files in a folder. It 
works for about 50% to 80% of the files. For the rest it gives an error 500. 
When I post a file individually for which it previously failed (using the 
python script) it sometimes works. When done in an ad hoc manner, it works most 
of the time but fails sometimes. At times it is successful for 
application/rdf+xml format but fails for application/json format. The behavior 
is inconsistent.



Here is an example trace of when it does not work as expected [0]

A sample of the data being used can be found here [1]

Any help would be appreciated.



[0] https://paste.apache.org/lbAm



[1] 
https://drive.google.com/file/d/0B6wmo4_-H0P2eWJjdTdtYS1HRGs/view?usp=sharing



Thanks,

Namrata Malarout

robust Tika and Hadoop

2015-07-15 Thread Allison, Timothy B.

All,

  I'd like to fill out our Wiki a bit more on using Tika robustly within 
Hadoop.  I'm aware of Behemoth [0], Nanite [1] and Morphlines [2].  I haven't 
looked carefully into these packages yet.

  Does anyone have any recommendations for specific configurations/design 
patterns that will defend against oom and permanent hangs within Hadoop?

  Thank you!

Best,

  Tim


[0] https://github.com/DigitalPebble/behemoth
[1] 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
[2] 
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/

RE: robust Tika and Hadoop

2015-07-20 Thread Allison, Timothy B.

Thank you, Ken and Mark.  Will update wiki over the next few days!

From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Monday, July 20, 2015 7:21 PM
To: user@tika.apache.org
Subject: RE: robust Tika and Hadoop

Hi Tim,

When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a 
TikaCallable 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java)

This lets us orphan the parsing thread if it times out 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187)

And provides a bit of protection against things like NoSuchMethodErrors that 
can be thrown by Tika if the mime-type detection code tries to use a parser 
that we exclude, in order to keep the Hadoop job jar size to something 
reasonable.

-- Ken

From: Allison, Timothy B.

Sent: July 15, 2015 4:38:56am PDT

To: user@tika.apache.org<mailto:user@tika.apache.org>

Subject: robust Tika and Hadoop

All,

  I'd like to fill out our Wiki a bit more on using Tika robustly within 
Hadoop.  I'm aware of Behemoth [0], Nanite [1] and Morphlines [2].  I haven't 
looked carefully into these packages yet.

  Does anyone have any recommendations for specific configurations/design 
patterns that will defend against oom and permanent hangs within Hadoop?

  Thank you!

Best,

  Tim

[0] https://github.com/DigitalPebble/behemoth
[1] 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
[2] 
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

RE: robust Tika and Hadoop

2015-07-21 Thread Allison, Timothy B.

Ken,
  To confirm your strategy: one new Thread for each call to Tika, add timeout 
exception handling, orphan the thread.

Out of curiosity, three questions:

1)  If I had more time to read your code, the answer would be 
obvious...sorryHow are you organizing your ingest?  Are you concatenating 
files into a SequenceFile or doing something else?  Are you processing each 
file in a single map step, or batching files in your mapper?

2)  Somewhat related to the first question, in addition to orphaning the 
parsing thread, are you doing anything else, like setting maximum number of 
tasks per jvm?  Are you configuring max number of retries, etc?

3)  Are you adding the AutoDetectParser to your ParseContext so that you'll 
get content from embedded files?

Thank you, again.

Best,

 Tim

From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Monday, July 20, 2015 7:21 PM
To: user@tika.apache.org
Subject: RE: robust Tika and Hadoop

Hi Tim,

When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a 
TikaCallable 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java)

This lets us orphan the parsing thread if it times out 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187)

And provides a bit of protection against things like NoSuchMethodErrors that 
can be thrown by Tika if the mime-type detection code tries to use a parser 
that we exclude, in order to keep the Hadoop job jar size to something 
reasonable.

-- Ken



From: Allison, Timothy B.

Sent: July 15, 2015 4:38:56am PDT

To: user@tika.apache.org<mailto:user@tika.apache.org>

Subject: robust Tika and Hadoop

All,

  I'd like to fill out our Wiki a bit more on using Tika robustly within 
Hadoop.  I'm aware of Behemoth [0], Nanite [1] and Morphlines [2].  I haven't 
looked carefully into these packages yet.

  Does anyone have any recommendations for specific configurations/design 
patterns that will defend against oom and permanent hangs within Hadoop?

  Thank you!

Best,

  Tim


[0] https://github.com/DigitalPebble/behemoth
[1] 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
[2] 
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

RE: robust Tika and Hadoop

2015-07-21 Thread Allison, Timothy B.

Thank you, Ken!

From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Tuesday, July 21, 2015 10:23 AM
To: user@tika.apache.org
Subject: RE: robust Tika and Hadoop

Hi Tim,

Responses inline below.

-- Ken

From: Allison, Timothy B.

Sent: July 21, 2015 5:29:37am PDT

To: user@tika.apache.org<mailto:user@tika.apache.org>

Subject: RE: robust Tika and Hadoop

Ken,
  To confirm your strategy: one new Thread for each call to Tika, add timeout 
exception handling, orphan the thread.

Correct.

Out of curiosity, three questions:
1)  If I had more time to read your code, the answer would be 
obvious...sorryHow are you organizing your ingest?  Are you concatenating 
files into a SequenceFile or doing something else?  Are you processing each 
file in a single map step, or batching files in your mapper?

Files are effectively concatenated, as each record (Cascading Tuple, or Hadoop 
KV pair) has the raw bytes plus a bunch of other data (headers returned, etc)

The parse phase is a map operation, so it's batch processing of all files 
successfully downloaded during that fetch loop.

2)  Somewhat related to the first question, in addition to orphaning the 
parsing thread, are you doing anything else, like setting maximum number of 
tasks per jvm?  Are you configuring max number of retries, etc?

If by "tasks per JVM" you mean the # of times we reuse the JVM, then yes - 
otherwise the orphan threads would eventually clog things up.

For retries, typically we don't set it (so defaults to 4), but in practice I'd 
recommend using something like 2 - so you get one retry, and then it fails, 
otherwise you typically fail four times on that error that could never possible 
happen but does.

3)  Are you adding the AutoDetectParser to your ParseContext so that you'll 
get content from embedded files?

No, not typically, as we're usually ignoring archive files. But that's a good 
point, with current versions of Tika we could now more easily handle those. It 
gets a bit tricky, though, as the UID for content is the URL, but now we'd have 
multiple sub-docs that we'd want to index separately.

From: Ken Krugler [mailto:kkrugler_li...@transpac.com<http://transpac.com/>]
Sent: Monday, July 20, 2015 7:21 PM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: RE: robust Tika and Hadoop

Hi Tim,

When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a 
TikaCallable 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java)

This lets us orphan the parsing thread if it times out 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187)

And provides a bit of protection against things like NoSuchMethodErrors that 
can be thrown by Tika if the mime-type detection code tries to use a parser 
that we exclude, in order to keep the Hadoop job jar size to something 
reasonable.

-- Ken

From: Allison, Timothy B.

Sent: July 15, 2015 4:38:56am PDT

To: user@tika.apache.org<mailto:user@tika.apache.org>

Subject: robust Tika and Hadoop

All,

  I'd like to fill out our Wiki a bit more on using Tika robustly within 
Hadoop.  I'm aware of Behemoth [0], Nanite [1] and Morphlines [2].  I haven't 
looked carefully into these packages yet.

  Does anyone have any recommendations for specific configurations/design 
patterns that will defend against oom and permanent hangs within Hadoop?

  Thank you!

Best,

  Tim

[0] https://github.com/DigitalPebble/behemoth
[1] 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
[2] 
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com<http://www.scaleunlimited.com/>
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com<http://www.scaleunlimited.com/>
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

FW: error Unsupported Media Type : while implementing ContentStreamUpdateRequestExample from the link http://wiki.apache.org/solr/ContentStreamUpdateRequestExample

2015-07-22 Thread Allison, Timothy B.

What happens when you run straight tika-app against that pdf file?

java -jar tika-app.jar Sample.pdf

(grab tika-app from: http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.9.jar)

Do you have all of the tika jars on your classpath/properly configured within 
your Solr setup?

-Original Message-
From: Kathrincolyn [mailto:kathrinco...@yahoo.in] 
Sent: Wednesday, July 22, 2015 5:57 AM
To: tika-...@lucene.apache.org
Subject: Re: error Unsupported Media Type : while implementing 
ContentStreamUpdateRequestExample from the link 
http://wiki.apache.org/solr/ContentStreamUpdateRequestExample

public class SolrExampleTests {
>
>   public static void main(String[] args) {
> try {
>   //Solr cell can also index MS file (2003 version and 2007 version)
> types.
>   String fileName = "c:/Sample.pdf";
>   //this will be unique Id used by Solr to index the file contents.
>   String solrId = "Sample.pdf";
>
>   indexFilesSolrCell(fileName, solrId);
>
> } catch (Exception ex) {
>   System.out.println(ex.toString());
> }
>   }
>
>   /**
>* Method to index all types of files into Solr.
>* @param fileName
>* @param solrId
>* @throws IOException
>* @throws SolrServerException
>*/
>   public static void indexFilesSolrCell(String fileName, String solrId)
> throws IOException, SolrServerException {
>
> String urlString = "http://localhost:8983/solr";;
> SolrServer solr = new CommonsHttpSolrServer(urlString);
>
> ContentStreamUpdateRequest up
>   = new ContentStreamUpdateRequest("/update/extract");
>
> up.addFile(new File(fileName));
>
> up.setParam("literal.id", solrId);
> up.setParam("uprefix", "attr_");
> up.setParam("fmap.content", "attr_content");
>
> up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
>
> solr.request(up);
>
> QueryResponse rsp = solr.query(new SolrQuery("*:*"));
>
> System.out.println(rsp);
>   }
> }
>
> Thanks
Ufindthem   



--
View this message in context: 
http://lucene.472066.n3.nabble.com/error-Unsupported-Media-Type-while-implementing-ContentStreamUpdateRequestExample-from-the-link-httpe-tp4169035p4218516.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

RE: Charset Encoding

2015-07-30 Thread Allison, Timothy B.

The AutoDetectReader (within TXTParser) runs the encoding detectors in order 
specified in 
tika-parsers...resources/META-INF/services/o.a.t.detect.EncodingDetector.

The AutoDetectReaders picks the first non-null response to detect.

The current order is:
org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector
org.apache.tika.parser.txt.Icu4jEncodingDetector

I've had some luck in some situations flipping the order so that Icu4j is run 
before Mozilla's UniversalEncodingDetector.

If that doesn't work,  you can create your own CP1256 detector that 
returns cp1256 all the time and then put that in the services file.

We had someone hit this issue a year or so ago with UTF-8 (where he know 
absolutely that the files were, no doubt about it, UTF-8).  

We've talked about having and "override" detector, but we haven't implemented 
that yet.



-Original Message-
From: Ben Gould [mailto:ben.go...@inovexcorp.com] 
Sent: Thursday, July 30, 2015 2:34 PM
To: user@tika.apache.org
Subject: Charset Encoding

Hi all,

I'm working on dynamically parsing a large set of Farsi documents 
(mostly txt, pdf, doc and docx), and am having issues when I come across 
text files encoded in CP1256 (an old windows-arabic format).

I'm using the Tika facade to return a Reader implementation (wrapping 
the input in a TikaInputStream) and then tokenizing the Reader using a 
Lucene Analyzer.  However, whenever it hits CP1256 encoded text files, 
it tries to decode them as (Content-Type -> text/plain; 
charset=x-MacCyrillic).  In the input metadata, I do provide the 
following properties:

Content-Encoding: CP1256
Content-Type: text/plain; charset=CP1256
Content-Type-Hint: text/plain; charset=CP1256

Any ideas on how I can force the TXTParser to use CP1256?

Thanks,
-Ben

RE: [VOTE] Apache Tika 1.10 Release Candidate #1

2015-08-03 Thread Allison, Timothy B.

+1, built Windows and Linux.  Relying on previous tests for 
performance/comparision results.

Thank you, Dave!

-Original Message-
From: David Meikle [mailto:loo...@gmail.com] 
Sent: Sunday, August 02, 2015 3:15 AM
To: d...@tika.apache.org; user@tika.apache.org
Subject: [VOTE] Apache Tika 1.10 Release Candidate #1

Hi Everyone,

A candidate for the Apache Tika 1.10 release is available at:

https://dist.apache.org/repos/dist/dev/tika/

The release candidate is a zip archive of the sources in:

http://svn.apache.org/repos/asf/tika/tags/1.10-rc1/

The SHA1 checksum of the archive is

b1573adcb194e2c09b77eccc3b1edd16bd4ac67d.

In addition, a staged maven repository is available here:

https://repository.apache.org/content/repositories/orgapachetika-1013

Please vote on releasing this package as Apache Tika 1.10.
The vote is open for the next 72 hours and passes if a majority of at least
three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.10

[ ] -1 Do not release this package because...

Here is my +1!

Cheers,
Dave

RE: TikaConfig with constructor args

2015-08-27 Thread Allison, Timothy B.

That’s on my todo list (TIKA-1508).  Unfortunately, that doesn’t exist yet.  
I’d recommend for now following the pattern of the PDFParser or the 
TesseractOCRParser.  The config is driven by a properties file.

As soon as my dev laptop becomes unbricked, I’m going to turn to TIKA-1508.  
Given my schedule, I’d hope to have this into tika trunk within the next few 
weeks.


From: Andrea Asta [mailto:asta.and...@gmail.com]
Sent: Thursday, August 27, 2015 4:38 AM
To: user@tika.apache.org
Subject: TikaConfig with constructor args

Hi all,
I've developed a new Parser for my custom file type.
This parser needs some configuration to init an external connections. Is there 
a way to specify the constructor params (or bean properties to set) in the Tika 
xml format?
Thanks
Andrea

RE: Does tika support "HWP"?

2015-09-02 Thread Allison, Timothy B.

Great.  In the meantime, if you could open a JIRA issue and attach some example 
files (including the different versions), it might be helpful for the community 
to take a look.

Thank you!

-Original Message-
From: Mungeol Heo [mailto:mungeol@gmail.com] 
Sent: Tuesday, September 01, 2015 9:02 PM
To: user@tika.apache.org
Subject: Re: Does tika support "HWP"?

Thank you for your reply.
I will try to write a customized parser for HWP file.
And if my code is "pretty enough", I will consider to contribute it.
Again, thank you.

On Tue, Sep 1, 2015 at 7:58 PM, Nick Burch  wrote:
> On Tue, 1 Sep 2015, Mungeol Heo wrote:
>>>
>>> java -jar tika-app-1.10.jar --list-supported-types | grep hwp 
>>> application/x-hwp
>
>
> That means the mime type has been defined in some way
>
>>> java -jar tika-app-1.10.jar --detect sample.hwp 
>>> application/x-tika-msoffice
>
>
> That means that the HWP file is based on the OLE2 file format, but 
> that no-one has told Tika about that, so detection isn't working 
> properly. If you could create a new bug in JIRA for this, and upload a 
> very small HWP file (ideally just a few KB), we can get that fixed
>
>> And another thing is, there is no 'application/x-hwp' in the 
>> supported formats list which are mentioned at 
>> 'http://tika.apache.org/1.10/formats.html' page.
>
>
> That means there is no parser available for HWP, and you'd need to 
> write + contribute one
>
>> So, does tika support "HWP"?
>
>
> Depends on your definition of "supports"!
>
> Nick

RE: tesseract issue

2015-09-09 Thread Allison, Timothy B.

You can build from source if you have an interest (and the bandwidth, time and 
disk space) or pull a nightly build if you don’t want to wait for 1.11, for 
example: 
https://builds.apache.org/view/Tika/job/tika-trunk-jdk1.7/849/org.apache.tika$tika-app/

Thank you, Christian!

Best,

Tim

From: Brian Young [mailto:bwyoung.s...@gmail.com]
Sent: Wednesday, September 09, 2015 4:09 PM
To: user@tika.apache.org
Subject: Re: tesseract issue

Ah that is very good- thank you.  Looks like it will be in 1.11.



On Wed, Sep 9, 2015 at 4:00 PM, Christian Wolfe 
mailto:taida...@gmail.com>> wrote:
Brian,

I submitted a patch for this bug that was accepted by the team - 
https://github.com/apache/tika/pull/56

I do'nt think it has made it to any release version.

On Wed, Sep 9, 2015 at 3:55 PM, Brian Young 
mailto:bwyoung.s...@gmail.com>> wrote:
Hello,

On OS X at least, tesseract and tessdata may not be under a common root.  e.g.:


/opt/local/share/tessdata

/opt/local/bin/tesseract



Unfortunately it looks like TesseractOCRParser does not accommodate for this 
since there is only one configuration value that is used for finding the binary 
as well as setting the TESSDATA _PREFIX environment var.



Now, TESSDATA_PREFIX does not get set if I do not pass in the path on the 
config object.  However, even though tesseract is in my path, it isn't found 
when the ProcessBuilder executes unless I've given it the full path... which of 
course sets the TESSDATA_PREFIX to the wrong thing.



It seems like maybe it would be best to handle these as two separate 
configuration values?  But short of that and a new version of Tika, does anyone 
have any other advice?



Thank you

Brian

RE: RecursiveParser returning ContentHandler

2015-09-22 Thread Allison, Timothy B.

Y, that should be easy enough.  Instead of the metadata list, we can store a 
list of Metadata+Handler pairs, the current “getMetadata()” can be syntactic 
sugar around the new getMetadataAndHandlers().

Please open a ticket and we can discuss there.

Thank you.

Best,

   Tim



From: Andrea Asta [mailto:asta.and...@gmail.com]
Sent: Monday, September 21, 2015 8:00 AM
To: user@tika.apache.org
Subject: RecursiveParser returning ContentHandler

Hi,
I', trying to build a custom Conversion API using Tika: it just will add 
"something before" and "something after" the Tika parsers.

In this scenario, I would like to build a mechanism to allow a custom object 
being built starting from a parsing result. This can be done easily by working 
with a custom ContentHandler "transformer", but how can I achieve this result 
using a RecursiveParserWrapper? In this case I can only set a 
ContentHandlerFactory and the parser will just call the toString method and set 
it as a metadata, is it right? Can we imagine something to get the entire 
ContentHandler object for each subfile instead of the result of the toString 
method?

Thanks
Andrea

RE: Maximizing performance when parsing a lot of files

2015-09-25 Thread Allison, Timothy B.

It's best to keep Tika in its own jvm.

If you are working filesystem to filesystem... The simplest thing to do would 
be to call tika-batch via the commandline of tika-app every so often.  By 
default, tika-batch will skip files that it has already processed if you run it 
again, but you will pay the small performance cost of crawling the entire 
directory with each run and checking whether there is an output file for each 
input file.

If you think this is a common enough use case, and I do, I'm wondering if it 
would make sense for us to experiment with adding a WatchService to 
tika-batch...Scratch that...probably wouldn't scale ("This API is not designed 
for indexing a hard drive. Most file system implementations have native support 
for file change notification."[0]).  I'm wondering if we could have the crawler 
automatically rerun from the start directory until the user tells tika-batch to 
stop or unless there have been no new files processed in X minutes.
 
If you are going db to db...that's another area for growth in tika-batch.

Finally, the real "big data" solution is probably to go with Spark and friends.

[0] https://docs.oracle.com/javase/tutorial/essential/io/notification.html
-Original Message-
From: zahlenm...@gmx.de [mailto:zahlenm...@gmx.de] 
Sent: Friday, September 25, 2015 7:33 AM
To: user@tika.apache.org
Subject: Maximizing performance when parsing a lot of files

So I have thousands of files to be run by Tika. Unfortunatly, these are not 
available at once but are "created" one by one. My tests have shown that the 
creator process is faster than Tika. So now I am wondering how I should combine 
creator and parser process to speed things up.
Btw. the creator is completly separate, otherwise I would include the parser 
calls directly in it. But this is not possible.
To achieve some kind of parallelism I thought of two options:
1) Spawn a new small Java code piece which parses a file
2) Send the file to Tika Jaxrs Server
But since the creator is so fast it would fire up multiple calls to Tika per 
second. On the other hand I don't want to wait for the creator to finish 
because it runs for houres and in the meantime I could already start parsing.
Any ideas?

RE: Tika unable to extract PDF Text

2015-10-14 Thread Allison, Timothy B.

File works with Tika trunk.  What's on your classpath: tika-app or just 
tika-core?  Is there a chance that you don't have tika-parsers on your cp?

-Original Message-
From: Adam Retter [mailto:adam.ret...@googlemail.com] 
Sent: Wednesday, October 14, 2015 12:14 PM
To: user@tika.apache.org
Subject: Tika unable to extract PDF Text

I have a PDF which was created using Apache PDF Box 2.0.0-SNAPSHOT.
Unfortunately Tika 1.10 seems unable to extract any text from the PDF, I don't 
get any exceptions or errors. The code is as simple as:

new Tika().parseToString(new FileInputStream(f))

Tika is always returning just the empty string.

The PDF is available here - http://static.adamretter.org.uk/adam-1.pdf

Any ideas?

--
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

RE: Questions about using the Tika

2015-10-21 Thread Allison, Timothy B.

Bouncing to user@tika...

If the PDFs have fixed fields (AcroForm), then that should be easy enough to 
parse out of the xhtml that Tika produces, or you could go with straight PDFBox.

If (as I suspect), these are free text resumes, then Tika can help pull out the 
text, but then you're on your own and off into the land of natural language 
processing (or some great regexes) to do the slot filling that you're looking 
for.

Oh, wait, don't forget that there's a chance that you might find useful 
information in the metadata of the PDF: author, company etc., but I have no 
idea how reliable that would be.

-Original Message-
From: Cao, Renzhi (MU-Student) [mailto:rc...@mail.missouri.edu] 
Sent: Wednesday, October 21, 2015 8:45 AM
To: Mattmann, Chris A (3980) ; 
dev-ow...@tika.apache.org
Cc: d...@tika.apache.org
Subject: Re: Questions about using the Tika

Dear all,
 I am interested in parsing the information (like name, skill, location and 
etc) from the PDF resume, and I see that it seems Tika can do that. Could you 
please let me know if it is possible or any example of how to use Tika to parse 
the resume? Thank you very much for your help!

Renzhi Cao
Graduate Research Assistant
Department of Computer Science
University of Missouri-Columbia
Columbia, MO 65211
Cell: 573-825-8874
Email : rc...@mail.missouri.edu
http://web.missouri.edu/~rcrg4/


From: Mattmann, Chris A (3980) 
Sent: Wednesday, October 21, 2015 12:14 AM
To: Cao, Renzhi (MU-Student); dev-ow...@tika.apache.org
Subject: Re: Questions about using the Tika

Please subscribe by sending email to dev-subscr...@tika.apache.org and then 
once you are subscribed post the below to d...@tika.apache.org.

Cheers!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion 
Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department University of Southern 
California, Los Angeles, CA 90089 USA
++





-Original Message-
From: "Cao, Renzhi (MU-Student)" 
Date: Tuesday, October 20, 2015 at 9:45 PM
To: "dev-ow...@tika.apache.org" 
Subject: Questions about using the Tika

>Dear editor of Tika project,
> I am interested in parsing the information (like name, skill, 
>location and etc) from the PDF resume, and I see that it seems Tika can 
>do that. Could you please let me know if it is possible or any example 
>of how to use Tika to parse the resume? Thank  you very much for your 
>help!
>
>
>
>
>
>
>Renzhi Cao
>Graduate Research Assistant
>Department of Computer Science
>University of Missouri-Columbia
>Columbia, MO 65211
>Cell: 573-825-8874
>Email : rc...@mail.missouri.edu
>Qje 
>ol9gdAIEexz2Okb9KSvfYJfxGlJ7wHelHyOveteZCNx50ztf78.&URL=mailto%3arcrg4%
>40m
>ail.missouri.edu>
>http://web.missouri.edu/~rcrg4/
>
>
>
>

RE: [VOTE] Apache Tika 1.11 Release Candidate #1

2015-10-21 Thread Allison, Timothy B.

+0 (some regressions in ppt content)

I just finished the batch comparison run on  ~1.8 million files in our govdocs1 
and commoncrawl corpora comparing Tika 1.10 to 1.11-rc1.  As a caveat, the eval 
code is still in development and there may be bugs in the reports.

Results are here: 
https://github.com/tballison/share/blob/master/tika_comparisons/tika_1_10_vs_1_11-rc1.zip
 

Key reports:
contents/content_diffs.csv (file had one corrupt row when viewing in 
Excel...manually deleted offending content)
exceptions/newExceptionsInBByMimeTypeByStackTrace.csv (small handful)
exceptions/fixedExceptionsInBByMimeType.csv  (none!)
mimes/mime_diffs_A_to_B.csv

On the positive side:
From "mime_diffs_A_to_B.csv", it looks like we are catching more pdfs as pdfs 
(that text/xhtml) than we were...great!  We're identifying more files as images 
(jpeg, pict) than as xhtml, and, from a quick look, this appears to be an 
improvement.  We have at least 9 new x-hwp-v5 (great!).

On the negative side:

1) We have a few regressions in ppt exceptions (six of the same aioobe).
2) We have regressions in ppt content (it looks like we're not adding a new 
line/word break where we need to).  The regressions are small per file, but 
they affect ~220 ppts out of ~1500 (~15%). 

Other than the regressions in ppt content, I'd be +1, but I don't think this is 
severe enough to warrant a re-spin.  Happy to look into a fix, though, if we 
want a re-spin...and even if we don't, I'll start looking into this asap.

-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Monday, October 19, 2015 10:23 AM
To: d...@tika.apache.org
Cc: user@tika.apache.org
Subject: [VOTE] Apache Tika 1.11 Release Candidate #1

Hi Folks,

A first candidate for the Tika 1.11 release is available at:

  https://dist.apache.org/repos/dist/dev/tika/

The release candidate is a zip archive of the sources in:
  http://svn.apache.org/repos/asf/tika/tags/1.11-rc1/

The SHA1 checksum of the archive is
d0dde7b3a4f1a2fb6ccd741552ea180dddab630a

In addition, a staged maven repository is available here:

https://repository.apache.org/content/repositories/orgapachetika-1014/


Please vote on releasing this package as Apache Tika 1.11.
The vote is open for the next 72 hours and passes if a majority of at least 
three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.11 [ ] -1 Do not release this 
package because…

Cheers,
Chris

P.S. Of course here is my +1.



++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion 
Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department University of Southern 
California, Los Angeles, CA 90089 USA
++

RE: Issues with extraction content of PDF files

2015-12-18 Thread Allison, Timothy B.

Hi Edwin,
  Thank you for reaching out to Tika.  As I mentioned [0], the issue appears to 
be that the pdf file doesn’t contain Unicode mappings for the characters in the 
document.  This means that PDFBox has no way of converting character codes 
within the PDF into anything useful.  I checked with pdftotext, and it also 
didn’t pull out anything useful.
   I’m not a PDF expert, and you may want to drop a note to the PDFBox users 
list to see if someone there might have a workaround/solution.

   Best,

   Tim


[0] 
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201512.mbox/%3cby2pr09mb11297223e13e266cfb2a5ffc7...@by2pr09mb112.namprd09.prod.outlook.com%3E

From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
Sent: Friday, December 18, 2015 4:44 AM
To: user@tika.apache.org
Subject: Issues with extraction content of PDF files

Hi,

I'm indexing some PDF documents in Solr. However, for certain PDF files, there 
are chinese text in the documents, but after indexing, what is indexed in the 
content is either a series of "??" or an empty content.

What could be the reason that causes this?

I've shared one of the file with the issue on dropbox, which you can access via 
the link here: 
https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0


Regards,
Edwin

RE: Questions about using AutoDetect and DigestParser

2016-01-05 Thread Allison, Timothy B.

>>Question1) Shouldn't this be more specific? Like PdfParser, 
>>OpenDocumentParser and so on.

Y, make sure to call metadata.getValues(X-Parsed-By) which returns an array of 
values and then iterate through that array to see the parsers that actually 
processed your doc.  If you call metadata.get(Property p), you only get the 
first value in the array.

>> Question2) I understand that there is the DigestingParser to add Md5 and 
>> Sha1 hashes to the metadata. But how can I "combine" the AutoDetectParser 
>> and the DigestingParser?

See DigestingParserTest [0] for exact code, but basically something like this:

Metadata m = new Metadata();
CommonsDigester.DigestAlgorithm[] algos = CommonsDigester.parse("md5,sha512");
Parser d = new DigestingParser(new AutoDetectParser(), new 
CommonsDigester(100, algos, m)

d.parse(InputStream)



[0] 
http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/DigestingParserTest.java?view=markup
-Original Message-
From: zahlenm...@gmx.de [mailto:zahlenm...@gmx.de] 
Sent: Tuesday, January 05, 2016 3:33 AM
To: user@tika.apache.org
Subject: Questions about using AutoDetect and DigestParser

Happy New Year everyone,
I have a small program for simple text and metadata extraction. It is really 
not more than this (in Scala):

val fileParser : AutoDetectParser = new AutoDetectParser()
val handler : WriteOutContentHandler = new WriteOutContentHandler(-1)
val metadata : Metadata = new Metadata()
val context : ParseContext = new ParseContext()

try {
fileParser.parse(stream, handler, metadata, context)
} catch ...

When I look at the metadata I always have this line: X-Parsed-By: 
org.apache.tika.parser.DefaultParser
Question1) Shouldn't this be more specific? Like PdfParser, OpenDocumentParser 
and so on.

Question2) I understand that there is the DigestingParser to add Md5 and Sha1 
hashes to the metadata. But how can I "combine" the AutoDetectParser and the 
DigestingParser?

Thanks so far
Kind regards

RE: Questions about using AutoDetect and DigestParser

2016-01-08 Thread Allison, Timothy B.

Sorry I couldn't help.  Please do let us know if you figure out what's going on.

Best,

 Tim

-Original Message-
From: zahlenm...@gmx.de [mailto:zahlenm...@gmx.de] 
Sent: Friday, January 08, 2016 3:43 AM
To: user@tika.apache.org
Subject: Re: Questions about using AutoDetect and DigestParser

Actually I think the test code is quite good to get an understanding how the 
DigestingParser works.  I tried every combination I could think of, but I 
couldn't make it work. The code mirrors the unit test as close as possible 
(only the input stream is different). As it seems it is related to my use of 
Scala. If I find the time I will try it again with Java to further pinpoint the 
problem. In the meantime I think I'll stick to java.security.MessageDigest.

Kind regards

-Original Message-
Sent: Thursday, 07 January 2016 um 18:49:09 Uhr
From: "Allison, Timothy B." 
To: "user@tika.apache.org" 
Subject: RE: Questions about using AutoDetect and DigestParser As for 1, y, 
sorry, that's a bug I've been meaning to fix... 

As for 2, you're right, the test code is fairly opaque.  Sorry.  The code below 
works when I put it in DigestingParserTest.

The behavior you're seeing with AutoDetectParser() happens when the 
AutoDetectParser fails to load parsers either via the config file or via SPI, 
which reads parsers to load from the Parser class' service file.  Is there any 
reason to think you're getting different SPI behavior with, say (= I don't know 
Scala, and I'm guessing...sorry)

val fileParser : Parser = new AutoDetectParser()

vs.

val fileParser : Parser = new DigestingParser(new AutoDetectParser(), digester)

I'm sure you've tried the following for kicks...(again, apologies for guessing)
val autoParser : AutoDetectParser = new AutoDetectParser()
val fileParser : DigestingParser = new DigestingParser(autoParser, 
digester)

Java unit test that works within DigestingParserTest:

@Test
public void testSimple() throws Exception {
CommonsDigester.DigestAlgorithm[] algos = 
CommonsDigester.parse("md5,sha256,sha384,sha512");
Metadata metadata = new Metadata();
Parser d = new DigestingParser(new AutoDetectParser(), new 
CommonsDigester(UNLIMITED, algos));
ContentHandler handler = new WriteOutContentHandler(-1);
try (InputStream input = 
DigestingParserTest.class.getResourceAsStream("/test-documents/testPDF.pdf")) {
d.parse(input, handler, metadata, new ParseContext());
}

String[] parsedBy = metadata.getValues("X-Parsed-By");
for (String v : parsedBy) {
System.out.println("Parsed by: " + v);
}

assertEquals("org.apache.tika.parser.DefaultParser", parsedBy[0]);
assertEquals("org.apache.tika.parser.pdf.PDFParser", parsedBy[1]);
}

Re: [DISCUSS] Tika 1.12-rc1 (was Re: New Tika release)

2016-01-25 Thread Allison, Timothy B.

Dunno where you are on this...I'm still snowed in.  It would be great if we 
could upgrade to PDFBox 1.8.11 if we haven't done so yet.  TIKA-1830.  Last I 
tried, we have to remove some "exceptional" handling in the unit test comparing 
the sequential to the non-sequential parser because the tests now pass.  Other 
than that, should be straightforward.  I raise this only because Uwe Schindler 
noted how important this improvement is for Solr running on Java 9.

If I had time, I'd also want to finish the upgrade to POI and then run the 
massive corpus tests.  Maybe tomorrow, but not today...argh...

Cheers,

  Tim


From: Markus Jelsma 
Sent: Thursday, January 21, 2016 3:41 PM
To: user@tika.apache.org
Subject: RE: [DISCUSS] Tika 1.12-rc1 (was Re: New Tika release)

Chris - that would be awesome! Nutch 1.12 can then bundle Tika 1.12!
Markus


-Original message-
> From:Mattmann, Chris A (3980) 
> Sent: Thursday 21st January 2016 21:30
> To: user@tika.apache.org
> Subject: [DISCUSS] Tika 1.12-rc1 (was Re: New Tika release)
>
> Fine by me. I can cut a 1.12-rc1 this weekend.
>
> If I don’t hear objections from the other devs, I’ll go for it
> on Friday. Also this will be the first Git release, so should
> be fun! :)
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
> -Original Message-
> From: Markus Jelsma 
> Reply-To: "user@tika.apache.org" 
> Date: Thursday, January 21, 2016 at 12:27 PM
> To: "user@tika.apache.org" 
> Subject: New Tika release
>
> >Hello PMC,
> >
> >With TIKA-1835 committed Apache Nutch can finally fully support text and
> >link extraction via Boilerpipe, something many Nutch users (myself not
> >included) have been looking forward too for the last few years. We, as
> >Nutch PMC, cannot release Nutch with that support without Tika so our
> >users must wait until this is resolved and available. I do not want to
> >put additional burden to a Tika release manager or whatever, but i do
> >want to kindly beg the Tika PMC to discuss a possible early release of a
> >new Apache Tika.
> >
> >Please let me know what you think.
> >
> >Regards,
> >Markus
>
>

RE: Using Tika that comes with Solr 5.2

2016-02-03 Thread Allison, Timothy B.

The problem (I think) is that tika-parsers.jar includes just the Tika parsers 
(wrappers) around a boatload of actual parsers/dependencies (POI, PDFBox, etc). 
 If you are using jars, I’d recommend the tika-app.jar which includes all 
dependencies.
From: Steven White [mailto:swhite4...@gmail.com]
Sent: Tuesday, February 02, 2016 7:01 PM
To: user@tika.apache.org
Subject: Using Tika that comes with Solr 5.2

Hi everyone,

I have written a standalone application that works with Solr 5.2.  I'm using 
the existing JARs that come with Solr to index data off a file system.  My 
applications scans the file system, looking for files and then uses Tika to 
extract the raw text and then sends the raw text to Solr, using SolrJ, for 
indexing.

What I'm finding is that Tika will not extract the raw text off PDF, 
Powerpoint, ets. files but it will off raw text files.

Here is the code for:

public static void parseWithTika() throws Exception {
  File file = new File("C:\\temp\\test.pdf");

  FileInputStream in =- new FileInputStream(file);
  AutoDetectParser parser = new AutoDetectParser();
  Metadata metadata = new Metadata();
  BodyContentHandler contentHandler = new BodyContentHandler();

  parse.parse(in, contentHandler, metadata);

  String content = contentHandelr.toString();  <=== 'content is always an empty 
string

  in.close();
}

In the above code, 'content' is always empty (the above is: off 
https://tika.apache.org/1.8/examples.html)

Solr 5.2 comes with the following Tika JARs which I have included all of them: 
tika-core-1.7.jar, tika-java7-1.7.jar, tika-parsers-1.7.jar, tika-xmp-1.7.jar, 
vorbis-java-tika-0.6.jar, kite-morphlines-tika-core-0.12.1.jar and 
kite-morphlines-tika-decompress-0.12.1.jar

Any idea why this isn't working?

Thanks!

Steve

RE: Preventing OutOfMemory exception

2016-02-08 Thread Allison, Timothy B.

I’m not sure why you’d want to append document contents across documents into 
one handler.  Typically, you’d use a new ContentHandler and new Metadata object 
for each parse.  Calling “toString()” does not clear the content handler, and 
you should have 20 copies of the extracted content on your final loop.

There shouldn’t be any difference across file types in the fact that you are 
appending a new copy of the extracted text with each loop.  You might not be 
seeing the memory growth if your other file types aren’t big enough and if you 
are only doing 20 loops.

But the larger question…what are you trying to accomplish?

From: Steven White [mailto:swhite4...@gmail.com]
Sent: Monday, February 08, 2016 1:38 PM
To: user@tika.apache.org
Subject: Preventing OutOfMemory exception

Hi everyone,

I'm integrating Tika with my application and need your help to figure out if 
the OOM I'm getting is due to the way I'm using Tika or if it is an issue with 
parsing XML files.

The following example code is causing OOM on 7th iteration with -Xmx2g.  The 
test will pass with -Xmx4g.  The XML file I'm trying to parse is 51mb in size.  
I do not see this issue with other file types that I tested so far.  Memory 
usage keeps on growing with XML file types, but stays constant with other file 
types.

public class Extractor {
private BodyContentHandler contentHandler = new BodyContentHandler(-1);
private AutoDetectParser parser = new AutoDetectParser();
private Metadata metadata = new Metadata();

public String extract(File file) throws Exception {
try {
stream = TikaInputStream.get(file);
parser.parse(stream, contentHandler, metadata);
return contentHandler.toString();
}
finally {
stream.close();
}
}
}

public static void main(...) {
Extractor extractor = new Extractor();
File file = new File("C:\\temp\\test.xml");
for (int i = 0; i < 20; i++) {
extractor.extract(file);
}

Any idea if this is an issue with XML files or if the issue in my code?

Thanks

Steve

RE: Preventing OutOfMemory exception

2016-02-08 Thread Allison, Timothy B.

In your actual code, are you using one BodyContentHandler for all of your 
files?  Or are you creating a new BodyContentHandler for each file?  If the 
former, then, y, there’s a problem with your code; if the latter, that’s not 
something I’ve seen before.

From: Steven White [mailto:swhite4...@gmail.com]
Sent: Monday, February 08, 2016 4:56 PM
To: user@tika.apache.org
Subject: Re: Preventing OutOfMemory exception

Hi Tim,

The code I showed is a minimal example code to show the issue I'm running into, 
which is: memory keeps on growing.

In production, the loop that you see will read files off a file system and 
parse them using the logic close to what I sowed.  I use 
contentHandler.toString() to get back the raw text so I can save it.  Even if I 
get ride of that call, I run into OOM.

Note that, if I test the exact same code against PDF or PPT or ODP or RTF (I 
still have far more formats to test) I do *NOT* see the OOM issue even when I 
increase the loop to 1000 -- memory usage remains steady and stable.  This is 
why in my original email I asked if there is an issue with XML files or with my 
code such as if I'm missing to close / release something.

Here is the full call stack when I get the OOM:

  Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.StringBuffer.ensureCapacityImpl(StringBuffer.java:338)
at java.lang.StringBuffer.append(StringBuffer.java:114)
at java.io.StringWriter.write(StringWriter.java:106)
at 
org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.TextContentHandler.characters(TextContentHandler.java:55)
at 
org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:102)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.xerces.parsers.AbstractSAXParser.characters(Unknown Source)
at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown 
Source)
at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
 Source)
at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)

Thanks

Steve

On Mon, Feb 8, 2016 at 3:07 PM, Allison, Timothy B. 
mailto:talli...@mitre.org>> wrote:
I’m not sure why you’d want to append document contents across documents into 
one handler.  Typically, you’d use a new ContentHandler and new Metadata object 
for each parse.  Calling “toString()” does not clear the content handler, and 
you should have 20 copies of the extracted content on your final loop.

There shouldn’t be any difference across file types in the fact that you are 
appending a new copy of the extracted text with each loop.  You might not be 
seeing the memory growth if your other file types aren’t big enough and if you 
are only doing 20 loops.

But the larger question…what are you trying to accomplish?

From: Steven White [mailto:swhite4...@gmail.com<mailto:swhite4...@gmail.co

RE: Preventing OutOfMemory exception

2016-02-09 Thread Allison, Timothy B.

Same parser is ok to reuse…should even be ok in multithreaded applications.

Do not reuse ContentHandler or Metadata objects.

As a side note, if you are handling a bunch of files from the wild in a 
production environment, I encourage separating Tika into a separate jvm vs 
tying it into any post processing – consider tika-batch and writing separate 
text files for each file processed (not so efficient, but exceedingly robust).  
If this is demo code or you know your document set well enough, you should be 
good to go with keeping Tika and your postprocessing steps in the same jvm.

From: Steven White [mailto:swhite4...@gmail.com]
Sent: Tuesday, February 09, 2016 10:35 AM
To: user@tika.apache.org
Subject: Re: Preventing OutOfMemory exception

Thanks Tim!!  You helped me find the defect in my code.

Yes, I'm using one BodyContentHandler.  When I changed my code to create a new 
BodyContentHandler for each XML file I'm parsing, I no longer see the OOM.  It 
is weird that I see this issue with XML files only.

For completeness, can you confirm if I have an issue in re-using a single 
instance of AutoDetectParser and Metadata throughout the life of my 
application?  The reason why I'm reusing a single instance is to cut down on 
overhead (I have yet to time this).

Steve


On Mon, Feb 8, 2016 at 8:33 PM, Allison, Timothy B. 
mailto:talli...@mitre.org>> wrote:
In your actual code, are you using one BodyContentHandler for all of your 
files?  Or are you creating a new BodyContentHandler for each file?  If the 
former, then, y, there’s a problem with your code; if the latter, that’s not 
something I’ve seen before.

From: Steven White [mailto:swhite4...@gmail.com<mailto:swhite4...@gmail.com>]
Sent: Monday, February 08, 2016 4:56 PM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Re: Preventing OutOfMemory exception

Hi Tim,

The code I showed is a minimal example code to show the issue I'm running into, 
which is: memory keeps on growing.

In production, the loop that you see will read files off a file system and 
parse them using the logic close to what I sowed.  I use 
contentHandler.toString() to get back the raw text so I can save it.  Even if I 
get ride of that call, I run into OOM.

Note that, if I test the exact same code against PDF or PPT or ODP or RTF (I 
still have far more formats to test) I do *NOT* see the OOM issue even when I 
increase the loop to 1000 -- memory usage remains steady and stable.  This is 
why in my original email I asked if there is an issue with XML files or with my 
code such as if I'm missing to close / release something.

Here is the full call stack when I get the OOM:

  Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.StringBuffer.ensureCapacityImpl(StringBuffer.java:338)
at java.lang.StringBuffer.append(StringBuffer.java:114)
at java.io.StringWriter.write(StringWriter.java:106)
at 
org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.TextContentHandler.characters(TextContentHandler.java:55)
at 
org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:102)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.xerces.parsers.AbstractSAXParser.characters(Unknown Source)
at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown 
Source)
at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
 Source)
at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.

RE: Preventing OutOfMemory exception

2016-02-09 Thread Allison, Timothy B.

Tika can fail catastrophically (permanent hangs, memory leaks, oom and other 
surprises).  These problems happen very, very rarely, and we fix problems as 
soon as we can, but really bad things can happen – see, e.g. TIKA-1132, 
TIKA-1401, SOLR-7764, PDFBOX-2200 and [0] and [1].

Tika runs within memory in Solr Cell.  The good news is that Tika works so well 
that no one has gotten around to putting it into its own jvm in Solr Cell.  I’m 
active on the Solr list and have shared potential problems with running Tika in 
the same jvm several times over there.

So, the short answer is: with the exception of TIKA-1401, I don’t _know_ of 
specific vulnerabilities that would cause serious problems with Tika.  However, 
given what we’ve seen, I have little reason to believe that these issues won’t 
happen again…very, very rarely.

I added tika-batch, which you can run from the commandline of tika-app, to 
handle these catastrophic failures.  You can also wrap your own solution via 
ForkParser or other methods.

[0] 
http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
[1] http://www.slideshare.net/gagravarr/whats-new-with-apache-tika
[2] 
http://mail-archives.apache.org/mod_mbox/lucene-dev/201507.mbox/%3cjira.12843538.1436367863000.133708.1436382786...@atlassian.jira%3E

From: Steven White [mailto:swhite4...@gmail.com]
Sent: Tuesday, February 09, 2016 5:37 PM
To: user@tika.apache.org
Subject: Re: Preventing OutOfMemory exception

Thanks for the confirmation Tim.

This is a production code, so ...

I'm a bit surprise why you suggest I keep the Tika code out-of-process as 
standalone application vs. directly using it from my app.  Are there known 
issues with Tika to prevent it from being used in a long running process?  Does 
Solr use Tika as an out-of-process application?  See 
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
 (I will also ask this question on the Solr mailing list).

A bit background about my application.  I am writing a file system crawler that 
will run 24x7xN-days uninterrupted.  The application monitors the file system 
once every N min. where N can be anywhere from 1 min and up for new files or 
updated files.  It will then send the file to Tika to extract the raw text and 
the raw text is than sent to Solr for indexing.  My file-system-crawler will 
not be recycled or stopped unless if the OS has to be restarted.  Thus, I 
expect it to run 24x7xN-days.  Finally, the file system is expected to be busy 
where on average there will be 10 new files added / updated per minute.  
Overall, I'm expecting to make at least 10 calls to Tika per min.

Steve


On Tue, Feb 9, 2016 at 12:07 PM, Allison, Timothy B. 
mailto:talli...@mitre.org>> wrote:
Same parser is ok to reuse…should even be ok in multithreaded applications.

Do not reuse ContentHandler or Metadata objects.

As a side note, if you are handling a bunch of files from the wild in a 
production environment, I encourage separating Tika into a separate jvm vs 
tying it into any post processing – consider tika-batch and writing separate 
text files for each file processed (not so efficient, but exceedingly robust).  
If this is demo code or you know your document set well enough, you should be 
good to go with keeping Tika and your postprocessing steps in the same jvm.

From: Steven White [mailto:swhite4...@gmail.com<mailto:swhite4...@gmail.com>]
Sent: Tuesday, February 09, 2016 10:35 AM

To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Re: Preventing OutOfMemory exception

Thanks Tim!!  You helped me find the defect in my code.

Yes, I'm using one BodyContentHandler.  When I changed my code to create a new 
BodyContentHandler for each XML file I'm parsing, I no longer see the OOM.  It 
is weird that I see this issue with XML files only.

For completeness, can you confirm if I have an issue in re-using a single 
instance of AutoDetectParser and Metadata throughout the life of my 
application?  The reason why I'm reusing a single instance is to cut down on 
overhead (I have yet to time this).

Steve


On Mon, Feb 8, 2016 at 8:33 PM, Allison, Timothy B. 
mailto:talli...@mitre.org>> wrote:
In your actual code, are you using one BodyContentHandler for all of your 
files?  Or are you creating a new BodyContentHandler for each file?  If the 
former, then, y, there’s a problem with your code; if the latter, that’s not 
something I’ve seen before.

From: Steven White [mailto:swhite4...@gmail.com<mailto:swhite4...@gmail.com>]
Sent: Monday, February 08, 2016 4:56 PM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Re: Preventing OutOfMemory exception

Hi Tim,

The code I showed is a minimal example code to show the issue I'm running into, 
which is: memory keeps on growing.

In production, the loop that you see will read files off a file system and 
parse them using the

RE: Using tika-app-1.11.jar

2016-02-11 Thread Allison, Timothy B.

Plan C: if you’re willing to store a mirror set of directories with the text 
versions of the files, just run tika-app.jar on your “input” directory and run 
your SolrJ loader on the “text/export” directory:

java -jar tika-app.jar  

And, if you’re feeling jsonic:

java -jar tika-app.jar –J -t –i  -o 


This method of running Tika will be robust to OOM, permanent hangs and 
OS-destroying-your-process-out-of-self-preservation incidents.


From: Steven White [mailto:swhite4...@gmail.com]
Sent: Thursday, February 11, 2016 10:18 AM
To: user@tika.apache.org
Subject: Re: Using tika-app-1.11.jar

Thank you Nick and everyone who has helped me with my questions.

I'm now understand Tika much better vs. where I was at last week when I first 
looked at it.

Steve

On Thu, Feb 11, 2016 at 8:18 AM, Nick Burch 
mailto:apa...@gagravarr.org>> wrote:
On Wed, 10 Feb 2016, Steven White wrote:
I'm including tika-app-1.11.jar with my application and see that Tika
includes "slf4j".

The Tika App single jar is intended for standalone use. It's not generally 
recommended to be included as part of a wider application, as it tends to 
include everything and the kitchen sink, to allow for easy standalone use

Generally, you should just tell Maven / Groovy / Ivy that you want to depend on 
Tika Core + Tika Parsers, then your build tool will fetch + bundle all the 
dependencies for you. That lets you have proper control over conflicting 
versions of jars etc

Nick

RE: How is Tika used with Solr

2016-02-11 Thread Allison, Timothy B.

x-post to Tika user's

Y and n.  If you run tika app as: 

java -jar tika-app.jar  

It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).  This 
creates a parent and child process, if the child process notices a hung thread, 
it dies, and the parent restarts it.  Or if your OS gets upset with the child 
process and kills it out of self preservation, the parent restarts the child, 
or if there's an OOM...and you can configure how often the child shuts itself 
down (with parental restarting) to mitigate memory leaks.

So, y, if your use case allows  , then we now have that 
in Tika.

I've been wanting to add a similar watchdog to tika-server ... any interest in 
that?


-Original Message-
From: xavi jmlucjav [mailto:jmluc...@gmail.com] 
Sent: Thursday, February 11, 2016 2:16 PM
To: solr-user 
Subject: Re: How is Tika used with Solr

I have found that when you deal with large amounts of all sort of files, in the 
end you find stuff (pdfs are typically nasty) that will hang tika. That is even 
worse that a crash or OOM.
We used aperture instead of tika because at the time it provided a watchdog 
feature to kill what seemed like a hanged extracting thread. That feature is 
super important for a robust text extracting pipeline. Has Tika gained such 
feature already?

xavier

On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson 
wrote:

> Timothy's points are absolutely spot-on. In production scenarios, if 
> you use the simple "run Tika in a SolrJ program" approach you _must_ 
> abort the program on OOM errors and the like and  figure out what's 
> going on with the offending document(s). Or record the name somewhere 
> and skip it next time 'round. Or
>
> How much you have to build in here really depends on your use case.
> For "small enough"
> sets of documents or one-time indexing, you can get by with dealing 
> with errors one at a time.
> For robust systems where you have to have indexing available at all 
> times and _especially_ where you don't control the document corpus, 
> you have to build something far more tolerant as per Tim's comments.
>
> FWIW,
> Erick
>
> On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B. 
> 
> wrote:
> > I completely agree on the impulse, and for the vast majority of the 
> > time
> (regular catchable exceptions), that'll work.  And, by vast majority, 
> aside from oom on very large files, we aren't seeing these problems 
> any more in our 3 million doc corpus (y, I know, small by today's 
> standards) from
> govdocs1 and Common Crawl over on our Rackspace vm.
> >
> > Given my focus on Tika, I'm overly sensitive to the worst case
> scenarios.  I find it encouraging, Erick, that you haven't seen these 
> types of problems, that users aren't complaining too often about 
> catastrophic failures of Tika within Solr Cell, and that this thread 
> is not yet swamped with integrators agreeing with me. :)
> >
> > However, because oom can leave memory in a corrupted state (right?),
> because you can't actually kill a thread for a permanent hang and 
> because Tika is a kitchen sink and we can't prevent memory leaks in 
> our dependencies, one needs to be aware that bad things can 
> happen...if only very, very rarely.  For a fellow traveler who has run 
> into these issues on massive data sets, see also [0].
> >
> > Configuring Hadoop to work around these types of problems is not too
> difficult -- it has to be done with some thought, though.  On 
> conventional single box setups, the ForkParser within Tika is one 
> option, tika-batch is another.  Hand rolling your own parent/child 
> process is non-trivial and is not necessary for the vast majority of use 
> cases.
> >
> >
> > [0]
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> eb-content-nanite/
> >
> >
> >
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: Tuesday, February 09, 2016 10:05 PM
> > To: solr-user 
> > Subject: Re: How is Tika used with Solr
> >
> > My impulse would be to _not_ run Tika in its own JVM, just catch any
> exceptions in my code and "do the right thing". I'm not sure I see any 
> real benefit in yet another JVM.
> >
> > FWIW,
> > Erick
> >
> > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B. 
> > 
> wrote:
> >> I have one answer here [0], but I'd be interested to hear what Solr
> users/devs/integrators have experienced on this topic.
> >>
> >> [0]
> >> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CC
> >> Y1P 
> >> R09MB079

RE: How is Tika used with Solr

2016-02-11 Thread Allison, Timothy B.

Right.  If you can't dump to a mirrored output directory, then you'll have to 
do your own monitoring.

If you can dump to a mirrored output directory, then tika-app will do all of 
the watchdog stuff for you.

If you can't, then, y, you're on your own.

If you want to get fancy, you could try implementing FileResourceConsumer in 
tika-batchLook at FSFileResourceConsumer as an example.  I've done this for 
reading Tika output and indexing w/ Lucene.

You might also look at StrawmanTikaAppDriver in the tika-batch module for an 
example of some basic multithreaded code that does what you suggest below.

-Original Message-
From: Steven White [mailto:swhite4...@gmail.com] 
Sent: Thursday, February 11, 2016 6:03 PM
To: solr-u...@lucene.apache.org
Subject: Re: How is Tika used with Solr

Tim,

In my case, I have to use Tika as follows:

java -jar tika-app.jar -t 

I will be invoking the above command from my Java app using 
Runtime.getRuntime().exec().  I will capture stdout and stderr to get back the 
raw text i need.  My app use case will not allow me to use a  
, it is out of the question.

Reading your summary, it looks like I won't get this watch-dog monitoring and 
thus I have to implement my own.  Can you confirm?

Thanks

Steve


On Thu, Feb 11, 2016 at 2:45 PM, Allison, Timothy B. 
wrote:

> x-post to Tika user's
>
> Y and n.  If you run tika app as:
>
> java -jar tika-app.jar  
>
> It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).  
> This creates a parent and child process, if the child process notices 
> a hung thread, it dies, and the parent restarts it.  Or if your OS 
> gets upset with the child process and kills it out of self 
> preservation, the parent restarts the child, or if there's an 
> OOM...and you can configure how often the child shuts itself down 
> (with parental restarting) to mitigate memory leaks.
>
> So, y, if your use case allows  , then we now 
> have that in Tika.
>
> I've been wanting to add a similar watchdog to tika-server ... any 
> interest in that?
>
>
> -Original Message-
> From: xavi jmlucjav [mailto:jmluc...@gmail.com]
> Sent: Thursday, February 11, 2016 2:16 PM
> To: solr-user 
> Subject: Re: How is Tika used with Solr
>
> I have found that when you deal with large amounts of all sort of 
> files, in the end you find stuff (pdfs are typically nasty) that will hang 
> tika.
> That is even worse that a crash or OOM.
> We used aperture instead of tika because at the time it provided a 
> watchdog feature to kill what seemed like a hanged extracting thread. 
> That feature is super important for a robust text extracting pipeline. 
> Has Tika gained such feature already?
>
> xavier
>
> On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson 
> 
> wrote:
>
> > Timothy's points are absolutely spot-on. In production scenarios, if 
> > you use the simple "run Tika in a SolrJ program" approach you _must_ 
> > abort the program on OOM errors and the like and  figure out what's 
> > going on with the offending document(s). Or record the name 
> > somewhere and skip it next time 'round. Or
> >
> > How much you have to build in here really depends on your use case.
> > For "small enough"
> > sets of documents or one-time indexing, you can get by with dealing 
> > with errors one at a time.
> > For robust systems where you have to have indexing available at all 
> > times and _especially_ where you don't control the document corpus, 
> > you have to build something far more tolerant as per Tim's comments.
> >
> > FWIW,
> > Erick
> >
> > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B.
> > 
> > wrote:
> > > I completely agree on the impulse, and for the vast majority of 
> > > the time
> > (regular catchable exceptions), that'll work.  And, by vast 
> > majority, aside from oom on very large files, we aren't seeing these 
> > problems any more in our 3 million doc corpus (y, I know, small by 
> > today's
> > standards) from
> > govdocs1 and Common Crawl over on our Rackspace vm.
> > >
> > > Given my focus on Tika, I'm overly sensitive to the worst case
> > scenarios.  I find it encouraging, Erick, that you haven't seen 
> > these types of problems, that users aren't complaining too often 
> > about catastrophic failures of Tika within Solr Cell, and that this 
> > thread is not yet swamped with integrators agreeing with me. :)
> > >
> > > However, because oom can leave memory in a corrupted state 
> > > (right?),
> > because you can't actually k

1 2 3 >

1 - 100 of 228 matches

Mail list logo