Marc Prud'hommeaux created TIKA-2564:
----------------------------------------

             Summary: Tika client cannot extract files from embedded archive 
formats
                 Key: TIKA-2564
                 URL: https://issues.apache.org/jira/browse/TIKA-2564
             Project: Tika
          Issue Type: Bug
         Environment: Mac OS 10.13.3 (17D47)

 

17:42 ext$ java -version

java version "9.0.1"

Java(TM) SE Runtime Environment (build 9.0.1+11)

Java HotSpot(TM) 64-Bit Server VM (build 9.0.1+11, mixed mode)

17:42 ext$ uname -a

Darwin bix.local 17.4.0 Darwin Kernel Version 17.4.0: Sun Dec 17 09:19:54 PST 
2017; root:xnu-4570.41.2~1/RELEASE_X86_64 x86_64

 

 
            Reporter: Marc Prud'hommeaux


 

This may be related to TIKA-2395. When trying to extract the files from 

tika/tika-parsers/src/test/resources/test-documents/test-documents.tgz 

 

% coursier launch org.apache.tika:tika-app:1.17 --main 
org.apache.tika.cli.TikaCLI -- --extract test-documents.tgz

I see the exception:

 

Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: 
Illegal IOException from org.apache.tika.parser.pkg.CompressorParser@62628e78

at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)

at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)

at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:205)

at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:486)

at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)

at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)

at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.base/java.lang.reflect.Method.invoke(Method.java:564)

at coursier.cli.qR.a(Unknown Source)

at coursier.cli.qQ.j(Unknown Source)

at coursier.cli.qW.a(Unknown Source)

at d.h.a.c(Unknown Source)

at b.b.c_(Unknown Source)

at d.b.d.E.g(Unknown Source)

at d.b.e.aW.g(Unknown Source)

at d.b.f.b.aa.a(Unknown Source)

at coursier.cli.qQ.b(Unknown Source)

at coursier.cli.Q.b(Unknown Source)

at b.J.c_(Unknown Source)

at d.F.h(Unknown Source)

at b.F.a(Unknown Source)

at coursier.cli.Coursier.main(Unknown Source)

at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)

at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.base/java.lang.reflect.Method.invoke(Method.java:564)

at coursier.Bootstrap.main(Bootstrap.java:428)

Caused by: java.io.IOException: mark/reset not supported

at java.base/java.io.InputStream.reset(InputStream.java:474)

at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:444)

at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)

at 
org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:1045)

at org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:222)

at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

... 28 more

 

However, I can browse the document fine using:

 

% coursier launch org.apache.tika:tika-app:1.17 --main 
org.apache.tika.cli.TikaCLI -- test-documents.tgz

 

This issue affects: test-documents.rar, test-documents.tar.Z, 
test-documents.tbz2, and test-documents.tgz

But it does not affect test-documents.7z, test-documents.cab, 
test-documents.ddf, test-documents.dmg, test-documents.tar, or 
test-documents.zip

 

 

 This makes me suspect that it has something to do with extracting files from 
packages that are embedded in other archive parsers.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to