1)      Right, the npe is caused by the exception returning null when we call 
getMessage().  In TIKA-1605, we modified all code in the project to check for 
null returned by getMessage().  So, in the "fixed" version, you'll still get 
your good old IOException.  I can't tell from your stacktrace what caused the 
IOException.

2)      Y, regular builds of 1.9's app (and other modules) are available via 
Jenkins here: 
https://builds.apache.org/view/Tika/job/tika-trunk-jdk1.7/org.apache.tika$tika-app/

3)      Ok, makes sense.

For kicks, you may want to change opening the file to:
is = TikaInputStream.get(file)
or maybe:
is = TikaInputStream.get(file, metadata)

And you'll want to surround your closing of the IS in a try/catch block.  Or 
use IOUtils.closeQuietly.

Finally, are you able to share the particular file that caused the IOException?
From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapa...@fitchratings.com]
Sent: Thursday, June 04, 2015 10:20 AM
To: Allison, Timothy B.; talli...@apache.org
Cc: user@tika.apache.org; Sauparna Sarkar
Subject: RE: Memory issues with PDF parser

Hi Timothy,
Thanks for the prompt reply.


1.)    Wouldn't fixing the null pointer exception in turn throw the IO 
exception? I saw that the null pointer exception was thrown inside the catch 
block of the IO exception? Any root cause for the IO exception??.

Is that also fixed?



I am including the code that threw the null pointer exception in tike 1.8



Exception:
10:53:12,218 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129)) java.lang.NullPointerException
10:53:12,219 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158)



Code in the pdf parser:
catch (IOException e) {
            //nonseq parser throws IOException for bad password
            //At the Tika level, we want the same exception to be thrown
            if (e.getMessage().contains("Error (CryptographyException)")) {
                metadata.set("pdf:encrypted", Boolean.toString(true));
                throw new EncryptedDocumentException(e);
            }


2.)    Do you have a snapshot or beta version of tika 1.9 that I could try with 
our pdf corpus? It would also help in your developer testing.

3.)    For the inline images, we have just set the defaults(which is to skip 
them as you had mentioned). I have not done any memory profiling till now. I 
will also try that.



Thanks,
MG

From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, June 04, 2015 7:19 AM
To: Mouthgalya Ganapathy; talli...@apache.org<mailto:talli...@apache.org>
Cc: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: RE: Memory issues with PDF parser

Hi Mouthgalya,
  We fixed that NPE in https://issues.apache.org/jira/browse/TIKA-1605, and the 
fix will be available in Tika 1.9, which should be out within a week.
  As for memory issues, we worked around a memory leak in PDFBox with static 
caching of fonts for Tika 1.7 (may have been 1.8), but there may be others.  
One potential memory hog is the processing of inline images within PDFs...have 
you configured Tika to pull those out (default is to skip them)?  Other than 
that, I'd recommend dropping a note to the PDFBox users list to get help in 
diagnosing memory consumption with PDFBox.  Have you tried any memory profiling?

          Best,

                    Tim

From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapa...@fitchratings.com]
Sent: Wednesday, June 03, 2015 3:25 PM
To: talli...@apache.org<mailto:talli...@apache.org>
Subject: Memory issues with PDF parser

Hi all,
I am trying to use Apache tika 1.8 for extracting contents from pdf. I have the 
below code for extracting it. It works well for few files. But if I read many 
files , I see out of memory exception.
I also see a Null pointer exception in the pdf parser. I think the null pointer 
exception is because of the memory exception.
Any suggestions?

Tika version:
  <dependency>
                     <groupId>org.apache.tika</groupId>
                     <artifactId>tika-server</artifactId>
                     <version>1.8</version>
        </dependency>

I am running it as a part of J2EE APP in JBoss 1.7

Code:-

//Parse the pdf content using Apache Tikka
            InputStream is = null;
            try {
              is = new BufferedInputStream(new FileInputStream(input));
              //Disable write limit.
              contenthandler = new BodyContentHandler(-1);
               metadata = new Metadata();
              pdfparser = new PDFParser();
              context = new ParseContext();
              pdfparser.parse(is, contenthandler, metadata, context);
              docBody=contenthandler.toString();
              //System.out.println(contenthandler.toString());
            }
            catch (Exception e) {
               System.out.println("Exception in updating docbody for report ==> 
" + report.getDocID());
               if(is==null)
                 System.out.println("The input stream is a null object");
               e.printStackTrace();
              logger.log(Level.SEVERE, e.getMessage(), e);
            }
            finally {
                if (is != null) is.close();
                contenthandler=null;
                metadata=null;
                pdfparser=null;
                context =null;
            }


Exception:-
I am just including the null pointer exception in the parser below.

10:53:11,696 INFO  [stdout] (Thread-11 
(HornetQ-client-global-threads-1619682129)) Exception in updating docbody for 
report ==> RPT_764268
10:53:12,218 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129)) java.lang.NullPointerException
10:53:12,219 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158)
10:53:12,219 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
com.fitch.researchapi.dao.ResearchReportMDAO.updateDocBody(ResearchReportMDAO.java:881)
10:53:12,219 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
com.fitch.researchapi.dao.ResearchReportMDAO.loadFile_NEW(ResearchReportMDAO.java:965)
10:53:12,220 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
com.fitch.researchapi.dao.ResearchReportMDAO.upsert_NEW(ResearchReportMDAO.java:676)
10:53:12,220 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
com.fitch.research.ejb.ResearchReportManagerBean.processResearchReport(ResearchReportManagerBean.java:70)
10:53:12,221 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
10:53:12,221 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
10:53:12,222 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
java.lang.reflect.Method.invoke(Method.java:597)
10:53:12,222 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72)
10:53:12,223 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,223 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53)
10:53:12,223 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36)
10:53:12,224 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,224 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.jpa.interceptor.SBInvocationInterceptor.processInvocation(SBInvocationInterceptor.java:47)
10:53:12,225 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,225 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InitialInterceptor.processInvocation(InitialInterceptor.java:21)
10:53:12,226 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,226 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)
10:53:12,227 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ee.component.interceptors.ComponentDispatcherInterceptor.processInvocation(ComponentDispatcherInterceptor.java:53)
10:53:12,227 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,228 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:51)
10:53:12,228 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,229 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:202)
10:53:12,229 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:306)
10:53:12,229 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:190)
10:53:12,230 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,230 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41)
10:53:12,231 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,231 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59)
10:53:12,231 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,232 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50)
10:53:12,232 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,233 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ejb3.component.interceptors.AdditionalSetupInterceptor.processInvocation(AdditionalSetupInterceptor.java:32)
10:53:12,233 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,233 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45)
10:53:12,234 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,234 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)
10:53:12,235 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165)
10:53:12,235 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:173)
10:53:12,235 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,236 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)
10:53:12,236 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:72)
10:53:12,236 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
com.fitch.research.ejb.ResearchReportManagerBeanLocal$$$view4.processResearchReport(Unknown
 Source)
10:53:12,868 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
com.fitch.research.ejb.mdb.ResearchQueueManagerMDB.onMessage(ResearchQueueManagerMDB.java:150)
10:53:12,868 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
10:53:12,869 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
10:53:12,869 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
java.lang.reflect.Method.invoke(Method.java:597)
10:53:12,870 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72)
10:53:12,870 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,871 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53)
10:53:12,871 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36)
10:53:12,872 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,872 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InitialInterceptor.processInvocation(InitialInterceptor.java:21)
10:53:12,872 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,873 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)
10:53:12,873 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ee.component.interceptors.ComponentDispatcherInterceptor.processInvocation(ComponentDispatcherInterceptor.java:53)
10:53:12,874 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,874 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:51)
10:53:12,874 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,875 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:202)
10:53:12,875 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:306)
10:53:12,876 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:190)
10:53:12,876 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,876 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41)
10:53:12,877 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,877 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59)
10:53:12,878 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,878 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50)
10:53:12,878 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,879 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ejb3.component.interceptors.AdditionalSetupInterceptor.processInvocation(AdditionalSetupInterceptor.java:43)
10:53:12,879 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,880 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ejb3.component.messagedriven.MessageDrivenComponentDescription$5$1.processInvocation(MessageDrivenComponentDescription.java:184)
10:53:12,880 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,881 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45)
10:53:12,881 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,881 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)
10:53:12,882 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165)
10:53:12,883 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:173)
10:53:12,883 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))    at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)

Thanks,
MG
Product Development Team




______________________________________________________________________
Confidentiality Notice: The information contained in this e-mail and any 
attachment(s) is confidential and for the use of the addressee(s) only. If you 
are not the intended recipient of this e-mail, do not duplicate or redistribute 
it by any means. Please delete this e-mail and any attachment(s) and notify us 
immediately. Unauthorized use, reliance, disclosure or copying of the contents 
of this e-mail and any attachment(s), or any similar action, is strictly 
prohibited. Fitch Ratings reserves the right, to the extent permitted by 
applicable law, to retain, monitor and intercept e-mail messages both to and 
from its systems.

This e-mail has been scanned by the MessageLabs Email Security System. For more 
information, please visit http://www.messagelabs.com/email.
______________________________________________________________________

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________

______________________________________________________________________
Confidentiality Notice: The information contained in this e-mail and any 
attachment(s) is confidential and for the use of the addressee(s) only. If you 
are not the intended recipient of this e-mail, do not duplicate or redistribute 
it by any means. Please delete this e-mail and any attachment(s) and notify us 
immediately. Unauthorized use, reliance, disclosure or copying of the contents 
of this e-mail and any attachment(s), or any similar action, is strictly 
prohibited. Fitch Ratings reserves the right, to the extent permitted by 
applicable law, to retain, monitor and intercept e-mail messages both to and 
from its systems.

This e-mail has been scanned by the MessageLabs Email Security System. For more 
information, please visit http://www.messagelabs.com/email.
______________________________________________________________________

Reply via email to